Learning Semantic String Transformations from Examples Rishabh Singh and Sumit Gulwani
FlashFill
Transformations Syntactic Transformations –Concatenation of regular expression based substring –“VLDB2012” “VLDB” Semantic Transformations –More than just characters –“1/5/2010” “May 1 st 2010”
Semantic Transformations Semantic information as relational tables –1 January, 2 February Learn table lookup queries –VLOOKUP macro 2 nd most problematic
Outline Lookup Transformations Lookup + Syntactic Transformations Case Studies
Table Lookup Transformation s Demo
Learning Framework Input Strings F Output String F1F1 1. Domain-specific Language L FnFn … 2. Algorithm to learn all F s from (i,o)
Emp Record SSNEmpIdName John Henry William Johnson Steve Russell Ian Jordan Mary Dina Input v 1 Output Steve Russell Select(Name, EmpRecord, (SSN = v 1 )) Example - Lookup
ItemRec ItemIdItem ST-340Stroller BI-567Bib DI-328Diapers WI-989Wipes AS-469Aspirator PriceRec ItemIdPrice ST-340$ BI-567$3.56 DI-328$21.45 WI-989$5.12 AS-469$2.56 Input v 1 Output Stroller$ Select(Price, PriceRec, (ItemId = Select(ItemId, ItemRec, Item = v 1 )) Example – Transitive Lookup
Learn Query ItemRec ItemIdItem ST-340Stroller BI-567Bib DI-328Diapers WI-989Wipes AS-469Aspirator PriceRec ItemIdPrice ST-340$ BI-567$3.56 DI-328$21.45 WI-989$5.12 AS-469$2.56 Input v 1 Output Stroller$ Select(Price, PriceRec, (ItemId = Select(ItemId, ItemRec, Item = v 1 ))
Strings reachable from input row Emp Record SSNEmpIdName John Henry William Johnson Steve Russell Ian Jordan
strings in table rows of visited nodes Steve Russell
…….. Repeat until k steps or fixpoint
…….. Steve Russell
Maintains tree structure –share common sub-expressions CNF of Boolean Conditionals –independent column predicates
Synthesize Procedure Synthesize((i 1,o 1 ), …, (i n,o n )) P = GenerateStr t (i 1,o 1 ) for j = 2 to n: P’ = GenerateStr t (i j,o j ) P = Intersect t (P’, P) return P
Semantic String Transformation s Demo
[GulwaniPOPL11]
Syntactic manipulations over lookup outputs Syntactic manipulations before indexing
SSN: Emp Record SSNEmpIdName John Henry William Johnson Steve Russell Ian Jordan Mr. Steve Russell
SSN: Emp Record SSNEmpIdName John Henry William Johnson Steve Russell Ian Jordan
SSN: Emp Record SSNEmpIdName John Henry William Johnson Steve Russell Ian Jordan
{ “SSN: ”, “ ”, “1125”, “Steve Russell” } Set of reachable strings
{ “SSN: ”, “ ”, “1125”, “Steve Russell” } Mr. Steve Russell
Experiments
Related Work Matching strings for table joins –Record Matching [Elmagarmid et. al. 07, Koudas et. al. SIGMOD06] –Schema Matching [Dhamankar et. al. SIGMOD04, Warren & Tompa VLDB06] Query Synthesis –from representative view [Das Sharma et.al. ICDT10, Tran et.al. SIGMOD09] Text-editing by example –QuickCode[Gulwani POPL11] –SMARTedit[Lau et.al. ML03], Simulatenous Editing[Miller et.al. USENIX01]
Thanks! End-Users Algorithm Designer s Software Developers Large potential
Backup slides
Semantic String Transformations Time (12 Hr)Time (24 Hr) 09309:30 AM 15203:20 PM =TEXT(C,”00 00”)+0
Semantic String Transformations DateFormatted Date Jun 3 rd,
Idea 1: Share sub-expressions T3T3 C1C1 C2C2 C3C3 s3s3 s4s4 s5s5 T1T1 C1C1 C2C2 C3C3 s1s1 s2s2 s3s3 T2T2 C1C1 C2C2 C3C3 s2s2 s3s3 s4s4 Select(C 3, T 2, C 1 =e) Select(C 2, T 3, C 1 =Select(C 2,T 2,C 1 =e)
Youtube Videos French Polish Urdu German Serbian Russian
Idea 2: CNF conditionals T C1C1 C2C2 C3C3 …CnCn C n+ 1 sssst v1v1 v2v2 … vmvm Out ssst
No. of Consistent Expressions
Succinct Representation
Performance
Ranking
Idea 2: CNF conditionals
Related Work Record Matching –Similarity functions for matching [Elmagarmid et. al. 07, Koudas et. al. SIGMOD06] –Customizable similarity function [Arasu et. al. VLDB09] Learning Schema Matches –iMAP [Dhamankar et. al. SIGMOD04] concat. of column strings using domain-specific knowledge –[Warren & Tompa VLDB06] concatenation of column substrings, single table
Related Work Query Synthesis [Das Sharma et.al. ICDT10, Tran et.al. SIGMOD09] –Infer relation from large representative example view –no joins or projections Text-editing using examples –QuickCode[Gulwani POPL11] string transformations –SMARTedit[Lau et.al. ML03], Simulatenous Editing[Miller et.al. USENIX01] programming by demonstration
General Framework A Domain-specific Transformation Language L –Expressive and succinct Efficient Data structures for set of expressions –Version-space algebra GenerateStr –All sets of expressions from I-O example Intersect –Intersect two sets of expressions
Emp Record SSNEmpIdName John Henry William Johnson Steve Russell Ian Jordan Mary Dina Input v 1 Output Steve Russell Select(Name, EmpRecord, (SSN = v 1 )) Example - Lookup
ItemRec ItemIdItem ST-340Stroller BI-567Bib DI-328Diapers WI-989Wipes AS-469Aspirator PriceRec ItemIdPrice ST-340$ BI-567$3.56 DI-328$21.45 WI-989$5.12 AS-469$2.56 Input v 1 Output Stroller$ Bib Aspirator Wipes Select(Price, PriceRec, (ItemId = Select(ItemId, ItemRec, Item = v 1 )) Example – Transitive Lookups
T1T1 C1C1 C2C2 C3C3 s1s1 s2s2 s3s3 T2T2 C1C1 C2C2 C3C3 s2s2 s3s3 s4s4 TiTi C1C1 C2C2 C3C3 sisi s i+1 s i+2 Example … TmTm Input v 1 Output s1s1 smsm
T i-1 C1C1 C2C2 C3C3 s i-1 sisi s i+1 T i-2 C1C1 C2C2 C3C3 s i-2 s i-1 sisi Sub-expression Sharing
Current State of the Art: Help forums
Observations Semantic string transformations Input-output examples based interaction –New disambiguating inputs Add-in with the same interface