FlashNormalize: Programming by Examples for Text Normalization International Joint Conference on Artificial Intelligence, Buenos Aires 7/29/2015FlashNormalize1.

Slides:



Advertisements
Similar presentations
Numbers First, review the numbers going from French to English. Then try going from English to French.
Advertisements

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
MATH DRILLS. 376 three hundred seventy-six 508 five hundred eight.
Maths Starter Fractions Decimals Percentages. ONE TENTH.
FlashExtract : A General Framework for Data Extraction by Examples
Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
Jumping Off Points Ideas of possible tasks Examples of possible tasks Categories of possible tasks.
Algorithms. Introduction Before writing a program: –Have a thorough understanding of the problem –Carefully plan an approach for solving it While writing.
A452 – Programming project – Mark Scheme
Bottom-up parsing Goal of parser : build a derivation
Chapter 3 Planning Your Solution
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Programming by Example using Least General Generalizations Mohammad Raza, Sumit Gulwani & Natasa Milic-Frayling Microsoft Research.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
31 st October, 2012 CSE-435 Tashwin Kaur Khurana.
1 L07SoftwareDevelopmentMethod.pptCMSC 104, Version 8/06 Software Development Method Topics l Software Development Life Cycle Reading l Section 1.4 – 1.5.
Artificial Intelligence (AI) Addition to the lecture 11.
At the end of this lesson you will be able to: Understand the value of a decimal by placing it on a number line. Understand the relationship a decimal.
BPMN By Hosein Bitaraf Software Engineering. Business Process Model and Notation (BPMN) is a graphical representation for specifying business processes.
Phase 2: Systems Analysis
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Dimensions in Synthesis Part 3: Ambiguity (Synthesis from Examples & Keywords) Sumit Gulwani Microsoft Research, Redmond May 2012.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Moderate Problem. Problem  Write a function to swap a number in place without temporary variables.
Reading and Writing Decimals
DataBase and Information System … on Web The term information system refers to a system of persons, data records and activities that process the data.
Numbers ZERO 0 ONE 1 TWO 2 THREE 3 FOUR 4 FIVE 5.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Automating String Processing in Spreadsheets using Input-Output Examples Sumit Gulwani Microsoft Research, Redmond.
Compositional Program Synthesis from Natural Language and Examples Mohammad Raza, Sumit Gulwani & Natasa Milic-Frayling Microsoft.
FlashMeta Microsoft PROSE SDK: A Framework for Inductive Program Synthesis Oleksandr Polozov University of Washington Sumit Gulwani Microsoft Research.
Recent Advances in Speech Translation Systems ESSLLI-2002 Tutorial Course August 12-16, 2002 Course Organizers: Alon Lavie – Carnegie Mellon University.
Of 24 lecture 11: ontology – mediation, merging & aligning.
Deductive Techniques for synthesis from Inductive Specifications Dagstuhl Seminar Oct 2015 Sumit Gulwani.
Tackling Ambiguity in PBE Rishabh Singh
Decimals © 2007 M. Tallman.
GC101 Introduction to computers and programs
Unit 1Dream homes ---Grammar.
Reading and Writing Decimals
Place Value.
Reading Decimals.
1 - one 2 - two 3 - three 4 - four 5 - five 6 - six 7 - seven
STANDARD 5 TH A SUBJECT -- MATHEMATICS
Numbers Let's recap !.
9 X Table Go for it!.
Database Performance Tuning and Query Optimization
Play.
one thousand eight hundred twelve
Place Value.
Counting Chart: Numbers 1 to 100
1 ONE 2 TWO.
Cardinal and ordinal numbers
Objective - To understand thousandths and ten-thousands
Place Value.
Big numbers Play.
Chapter 3 DataStorage Foundations of Computer Science ã Cengage Learning.
READING AND WRITING NUMBERS. Write the following numbers in words: $ ,356 7/8 17/07/60 4/11/ $2,987, ,789.
twenty-eight hundredths? Who has one hundred five and four tenths?
Big numbers Play.
Thirty-six eighty thirty fifteen ten seventeen Forty-seven Forty-one
Chapter 11 Database Performance Tuning and Query Optimization
Templates of slides for P4 Experiments with your synthesizer
Introduction to Programming
Decimals Year 4 (age 8-9) - Hundredths
+/- Numbers Year 2 – Addition and subtraction of units within 100
3,050,020 = 3,000, Write the number in words. 6,140,050 = 6,000, ,
1, NOTES Review: Naming Decimals:
Faculty of Computer Science and Information System
Presentation transcript:

FlashNormalize: Programming by Examples for Text Normalization International Joint Conference on Artificial Intelligence, Buenos Aires 7/29/2015FlashNormalize1 Dileep KiniSumit Gulwani

What is Text Normalization? Real text contains Non-standard words (NSWs) : numbers, dates, currencies, phone numbers etc. [Sproat, 2010] Normalization = converting NSWs into contextually appropriate and consistently formatted variants. Applications like text-to-speech, machine-translation, speech- recognition training require Normalization of such words. 7/29/2015FlashNormalize2

Typical Tasks 7/29/2015FlashNormalize3 InputEnglish 1234One thousand two hundred and thirty four 850Eight hundred and fifty 79000Seventy nine thousand Number Translations French Mille deux cent trente-quatre Huit cent cinquatre Soixante-dix-neuf mille Dates InputOutput Jan 08, 2065January eighth twenty sixty five Apr 23, 2006April twenty third two thousand six Aug 10, 1900August tenth nineteen hundred Input Variation 08/01/ /04/ /08/1900

Challenges Traditional method: manual programming Scalability: large number of domain/format/language combinations Requires pairing of programmer and language expert Recent techniques: Statistical methods Requires large number of examples Obtained transformation not 100% accurate Our approach in FlashNormalize: Programming-by-Examples Fewer examples 100% Accurate Cannot handle noise in the data 7/29/2015FlashNormalize4

Problem Formulation Consider certain functions that take an input string and produces a sequence of strings For dates we need a function that transforms the input string “Jan 08, 2065” into January eighth twenty sixty five The specification provided by the user is input-output pairs The goal is to learn a function that is consistent with all the given examples 7/29/2015FlashNormalize5

Solution Overview 7/29/2015FlashNormalize6 Domain Specific Language The space of possible programs (Concept Class) A Programming-by-Examples technology Learning Algorithm

Domain Specific Language (DSL) Description of the space of possible programs 7/29/2015FlashNormalize7 … PredicateConcat Expr Month(Split(v,0)) Ordinal(Trim(Dig(v,0)) “thousand”

Synthesis Algorithm Given a set of input-output example pairs, derive a program from the DSL that is consistent with all the examples. Our algorithm has 2 logically distinct phases A bottom-up learning of process expressions for individual examples A top-down search for decision lists and concats for all examples 7/29/2015FlashNormalize8

Learning Decision Lists 7/29/2015FlashNormalize9

Learning Concat Expressions 7/29/2015FlashNormalize10

Learning Process Expressions Process exprs are described using a non-recursive grammar We use the Version-Space-Algebra [Lau et al. 2000] to represent sets of programs associated with a non-terminal bucket programs together that behave similarly on the given input use a bottom-up approach to symbolically enumerate these buckets 7/29/2015FlashNormalize11 string S := B | Substr(B,k,k); string B := v | Split(v,k) | Dig(v,k); int k := -10 | -9 | … | 10;

Synthesis Strategies 7/29/2015FlashNormalize12 Our learning algorithm requires: 1.A set of representative examples 2.Descriptions of the tables used in process expressions Determining either or both can be challenging! Modularity: Separation of a program into smaller ones which can be reused When a program to be learnt is potentially huge we try learning programs that handle certain parts of the output and use them to learn a complete program Active Learning: for assisting the user find the right examples, and synthesizing tables domain knowledge encoded in the form an algorithm that suggests inputs on which hypothesis program might be wrong Queries: a) Membership b) Equivalence c) Test

Evaluation 7/29/2015FlashNormalize13 TMETmDl Russian Polish French TMETmDl Chinese German Portuguese TMETmDl Spanish English Italian T: #test queries, M: #membership queries E: # examples used in synthesis Tm: time taken in seconds Dl : length of the decision list

Thank You! 7/29/2015FlashNormalize14

Extras 7/29/2015FlashNormalize15 String -> Boolean Parse Expr: functions that extract substring of the input, described by a grammar String -> String Synthesis Algorithm Set of examples E A program in the DSL consistent with E

7/29/2015FlashNormalize16 Bottom up learning of process expressions: Process expressions are described using a grammar We perform a symbolic bottom-up enumeration [Menon et al, 13] of the programs using Version Space Algebra [Lau et al.,00]

7/29/2015FlashNormalize17 Learning MCC for concat expressions: Substrings of the output annotated with process expr that explain the substring gives rise to a DAG representation of all concats that produce the output for that input Parallel DFS across all DAGs to obtain subsets explained by common concats