Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Manipulation using Programming by Examples and Natural Language Invited Upenn April 2015 Sumit Gulwani.

Similar presentations


Presentation on theme: "Data Manipulation using Programming by Examples and Natural Language Invited Upenn April 2015 Sumit Gulwani."— Presentation transcript:

1 Data Manipulation using Programming by Examples and Natural Language Invited Talk @ Upenn April 2015 Sumit Gulwani

2 1 The New Opportunity End Users (non-programmers with access to computers) Software developer 2 orders of magnitude more end users Struggle with simple repetitive tasks Need domain-specific expert systems Traditional customer for PL technology

3 Excel help forums

4 Typical help-forum interaction 300_w5_aniSh_c1_b  w5 =MID(B1,5,2) 300_w30_aniSh_c1_b  w30 =MID(B1,FIND(“_”,$B:$B)+1, FIND(“_”,REPLACE($B:$B,1,FIND(“_”,$B:$B),””))-1) =MID(B1,5,2)

5 Flash Fill (Excel 2013 feature) demo

6 Data locked up in silos in various formats –Great flexibility in organizing (hierarchical) data for viewing but challenging to manipulate and reason about the data. A typical workflow might involve one or more following steps –Extraction –Transformation –Querying –Formatting PBE and PBNL can enable delightful data wrangling. 5 Data Manipulation

7 To get Started! Data Science Class Assignment

8 FlashExtract

9

10 FlashExtract Demo 9

11 10 Architecture Intent Program Search Algorithm (Inductive Spec)

12 Examples: Conjunction of (input state, output state) Inductive Spec generalizes Examples in 2 ways. Generalization 1: Conjunction of (input state, output property) Motivation: Output properties are easier to specify intent. 11 Inductive Specification

13 12 Output properties Subsequence of the output list Elements not belonging to the output list Contiguous subsequence of the output list Prefix of the output list Task

14 13 Output properties Task Prefix of the output table (seq of records) We do not require explicit (magenta) record boundaries in which case the spec is: Prefixes of projections of the output table

15 Examples: Conjunction of (input state, output state) Inductive Spec generalizes Examples in 2 ways. Generalization 1: Conjunction of (input state, output property) Motivation: Output properties are easier to specify intent. Generalization 2: Boolean comb of (input state, output property) Motivation: Arises internally as part of specification refinement 14 Inductive Specification

16 15 Architecture Intent Program Search Algorithm DSL (Inductive Spec) Challenge 1: Designing efficient search algorithm.

17 Consider the tasks: 1. [String s -> Substring] (arises in FlashFill) 2. [Long String s ->List of Substrings] (arises in FlashExtract) Regular expression suffices for both, but is not ideal. Difficult to synthesize Difficult to explain to the user We propose abstractions that involve simpler regexes. 16 DSL for Substring Extraction

18 Consider the tasks: 1. [String s -> Substring] (arises in FlashFill) 2. [Long String s ->List of Substrings] (arises in FlashExtract) DSL for Task 1, i.e., [String s -> Substring] := let p1 = [s -> index] in let p2 = [s -> index] in SubStr(s, p1, p2) DSL for [String s -> index] := Constant | Pos(s, regex1, regex2, k) // k th position in s whose left/right side matches with regex1/regex2 17 DSL for Substring Extraction | let t = Suffix(s,p1) in [t -> index]

19 Let w = SubStr(s, p, p’) where p = Pos(s, r 1, r 2, k) and p’ = Pos(s, r 1 ’, r 2 ’, k’) 18 The SubStr Operator s p p’ w w1w1 w2w2 w1’w1’ w2’w2’ r 1 matches w 1 r 2 matches w 2 r 1 ’ matches w 1 ’ r 2 ’ matches w 2 ’

20 Consider the tasks: 1. [String s -> Substring] (arises in FlashFill) 2. [Long String s ->List of Substrings] (arises in FlashExtract) DSL for Task 2, i.e., [String s -> List of substrings] := let L = Filter(Split(s,”\n”), [Line -> Bool]) in Map(L, [String -> Substring]) DSL for [Line t -> bool] := MatchRegex(t, regex) | MatchRegex(t.previous, regex) 19 DSL for Substring Extraction

21 20 Architecture Intent Program Search Algorithm DSL Deductive Reasoning Rules for specification refinement (Inductive Spec) Challenge 1: Designing efficient search algorithm.

22 DSL for [String s -> List of substrings] : let L = Filter(Split(s,”\n”), [Line -> Bool]) in Map(L, [String -> Substring] ) 21 Deductive Reasoning for Specification Refinement Spec for [String ->List of substrings] Spec for [Line ->Bool] Spec for [String ->Substring]

23 DSL for [String s -> Substring] := let p1 = [s -> index] in let p2 = [s -> index] in SubStr(s, p1, p2) 22 Deductive Reasoning for Specification Refinement ≡ 01/12/2012 Spec for p1 Spec for p2 Spec for [String -> Substring] Disjunctions & Conjunctions are handled using union & intersection over program sets (Version Space Algebras)

24 23 Architecture Intent Program Search Algorithm DSL Deductive Reasoning Rules for specification refinement (Inductive Spec) Ranking Function Challenge 1: Designing efficient search algorithm. Challenge 2: Ambiguous/under-specified intent may result in unintended programs.

25 Synthesize multiple programs & rank them using machine learning. General Principles for ranking Prefer shorter programs. Prefer programs with fewer constants. Ranking Strategies Baseline: Pick any minimal sized program using minimal number of constants. Machine Learning: Score programs using a weighted combination of program features. –Weights are learned using training data. 24 Ranking

26 25 Experimental Comparison of Ranking Strategies StrategyAverage # of examples required Baseline4.17 Learning1.48 Technical Report: “Predicting a correct program in Programming by Example” Rishabh Singh, Sumit Gulwani Baseline Learning

27 FlashFill Ranking Demo 26

28 FlashMeta Architecture Intent Program Search Algorithm DSL Deductive Reasoning Rules for specification refinement (Inductive Spec) Ranking Function Challenge 1: Designing efficient search algorithm. Challenge 2: Ambiguous/under-specified intent may result in unintended programs. 28

29 “It's a great concept, but it can also lead to lots of bad data. I think many users will look at a few "flash filled" cells, and just assume that it worked. … Be very careful.” 28 Need for a better User Interaction Model! “most of the extracted data will be fine. But there might be exceptions that you don't notice unless you examine the results very carefully.”

30 Make it easy to inspect output correctness –User can accordingly provide more examples Show programs –in any desired programming language; in English –Enable effective navigation between programs Computer initiated interactivity (Active learning) –Highlight less confident entries in the output. –Ask directed questions based on distinguishing inputs. 29 User Interaction Models for Ambiguity Resolution

31 FlashExtract Demo (User Interaction Models) 30

32 Extraction FlashExtract: Extract data from text files, web pages [PLDI 2014; Powershell convert-from-string API] FlashRelate: Extract data from spreadsheets [PLDI 2015] Transformation Flash Fill: Excel feature for Syntactic String Transformations [POPL 2011] Semantic String Transformations [VLDB 2012] Number Transformations [CAV 2013] Querying NLyze: an Excel programming-by-natural-lang add-in [SIGMOD 2014] Formatting Table re-formatting [PLDI 2011] FlashFormat: a Powerpoint add-in [AAAI 2014] 31 PBE/PBNL tools for Data Manipulation

33 FlashMeta Architecture Intent Programs Search Algorithm DSL Deductive Reasoning Rules for specification refinement (Inductive Spec) Ranking Function The Inductive Synthesis Problem Definition: Intent x DSL x Ranking function -> Top k-Programs Solution Strategy: Spec Refinement based on deductive rules Tech Report: “FlashMeta: A Framework for Inductive Program Synthesis” Alex Polozov, Sumit Gulwani

34 Project FlashFill FlashExtractText FlashRelate FlashNormalize FlashExtractWeb 33 Comparison of FlashMeta with hand-tuned implementations OriginalFlashMeta 123 74 52 172 N/A2.5 OriginalFlashMeta 91 81 81 72 N/A1.5 Lines of Code (K) Development time (months) Running time of FlashMeta implementations vary between 0.5- 3x of the corresponding original implementation. Faster because of some free optimizations Slower because of larger feature sets & a generalized framework

35 FlashRelate + NLyze Demo 34

36 Other application domains. Integration with existing programming environments. Multi-modal intent specification using combination of Examples and NL. 35 Other Directions

37 36 SmartSynth: SmartPhone Script Synthesis using NL MobiSys 2013: “SmartSynth: Synthesizing Smartphone Automation Scripts from Natural Languages”; Vu Le, Sumit Gulwani, Zhendong Su

38 Vu Le Collaborators Dan Barowy Ted Hart Maxim Grechkin Alex Polozov Dileep Kini Rishabh Singh Mikael Mayer Mark Marron Gustavo Soares Ben Zorn

39 Data manipulation is challenging! –Data scientists spend 80% time cleaning data. –99% of end users are non-programmers. PBE/PBNL can enable delightful data wrangling! Cross-disciplinary inspiration –Theory/Logical Reasoning (Search algo) –Language Design (DSL) –Machine Learning (Ranking) –HCI (User interaction models) 38 Data Manipulation using PBE/PBNL


Download ppt "Data Manipulation using Programming by Examples and Natural Language Invited Upenn April 2015 Sumit Gulwani."

Similar presentations


Ads by Google