Presentation is loading. Please wait.

Presentation is loading. Please wait.

FlashExtract : A General Framework for Data Extraction by Examples Vu Le (UC Davis) Sumit Gulwani (MSR)

Similar presentations


Presentation on theme: "FlashExtract : A General Framework for Data Extraction by Examples Vu Le (UC Davis) Sumit Gulwani (MSR)"— Presentation transcript:

1 FlashExtract : A General Framework for Data Extraction by Examples Vu Le (UC Davis) Sumit Gulwani (MSR)

2 motivation..…

3 demo

4 schema extraction program o Output schema o Field extraction programs for all fields in the schema

5 output schema o XML-like: sequence and structure Seq ( [blue] Struct(Name: [green] String, City: [yellow] String) )

6 field extraction program o An ancestor o A program in the DSL Examples o Green = o Yellow =

7 data extraction DSL o DSL is a tuple (G, N 1, N 2 ) o G : grammar defining extraction strategies o N 1 : top-level SeqRegion nonterminal o N 2 : top-level Region nonterminal o Each non-terminal has a learn method

8 core algebra o Decomposable Map Operator o Filter Operators o Merge Operator o Pair Operator

9 city example

10 1. Filter lines that end with “WA”

11 city example 1. Filter lines that end with “WA” 2. Map each selected line to a pair of positions

12 city example 1. Filter lines that end with “WA” 2. Map each selected line to a pair of positions 3. Learn two leaf exprs for the two positions

13 learning algorithm o Inductive on the grammar structure o Learn city = learn a map operator o The lines that hold the city o The pair that identifies the city within a line

14 learning algorithm o Inductive on the grammar structure o Learn city = learn a map operator o The lines that hold the city o The pair that identifies the city within a line o Learn lines = learn a Boolean filter

15 inductive synthesis 1. Problem Definition: Identify a vertical domain of tasks that users struggle with 2. Domain-Specific Language (DSL): Design a DSL that can succinctly describe tasks in that domain 3. Synthesis Algorithm: Develop an algorithm that can efficiently translate examples into likely programs in DSL 4. Machine Learning: Rank the various programs 5. User Interface: Provide an appropriate interaction mechanism to resolve ambiguities

16 pros & cons o Advantages o Efficient synthesizer o Easier ranking control o Tighter integration with user interaction model o Disadvantages o Non-constructive: require thinking & implementation o Non-modular: DSL is not extensible

17 inductive meta- synthesis o A synthesizer for a related family of DSLs that supports a common user interaction model o Alleviate disadvantages of the generic methodology

18 inductive meta- synthesis o Identify a family of vertical task domains o Design an algebra for DSLs o Implement a search algorithm for each algebra operator

19 inductive meta- synthesis o Identify a family of vertical task domains o Design an algebra for DSLs o Implement a search algorithm for each algebra operator Test-Driven Synthesis (Perelman et. al.) Synthesis Wed

20 extraction meta- synthesis o Identify a family of vertical task domains o Extraction of semi-structured documents o Design an algebra for DSLs o Merge, Map, FilterBool, FilterInt, Pair o Implement a search algorithm for each algebra operator o Compositional and inductive learners

21 synthesis algorithm o Top-down o Top-level SeqRegion, Region symbols N 1, N 2 o Grammar-guided o Grammar built from the algebra operators

22 key insight o Reduce learning task for an expression to learning tasks for its sub-expressions o Examples: Learn Map (λx : F, S) o Learn the scalar expression F o Learn the sequence expression S

23 instantiations o Text files o Web pages o Spreadsheets

24 demo

25 evaluation o Can FlashExtract extract data from real-world files? o How many interactions typically required? o How efficient/real-time is FlashExtract?

26 expressiveness o Can FlashExtract extract data from real-world files? o How many interactions typically required? o How efficient/real-time is FlashExtract?

27 benchmarks o 25 text files o System log files o Copied texts from web pages and PDFs o Samples from “Pro Perl Parsing” o 25 webpages from [1] o Add two more test cases for each web page o 25 spreadsheets o 7 from [2] that are applicable for extracting o 18 from EUSES corpus [1] E. Oro, M. Ruffolo, and S. Staab. Sxpath: extending xpath towards spatial querying on web documents. Proc. VLDB Endow., [2] B. Harris and S. Gulwani. Spreadsheet table transformations from examples. In PLDI, 2011.

28 effectiveness o Can FlashExtract extract data from real-world files? Yes o How many interactions typically required? 2.36 examples o How efficient/real-time is FlashExtract?

29 efficiency o Can FlashExtract extract data from real-world files? Yes o How many interactions typically required? 2.36 examples o How efficient/real-time is FlashExtract? 0.82s last interaction

30 conclusion o Inductive meta-synthesis o FlashExtract is general o Text file, web page, spreadsheet instantiations o FlashExtract is practical o Extract real-world data, in real time, within a few examples

31 thank you Questions?


Download ppt "FlashExtract : A General Framework for Data Extraction by Examples Vu Le (UC Davis) Sumit Gulwani (MSR)"

Similar presentations


Ads by Google