Presentation is loading. Please wait.

Presentation is loading. Please wait.

The quest for robustness, scalability, and portability in (spoken) language applications Linguistics Methodology meets Language Reality: Bob Carpenter.

Similar presentations

Presentation on theme: "The quest for robustness, scalability, and portability in (spoken) language applications Linguistics Methodology meets Language Reality: Bob Carpenter."— Presentation transcript:

1 the quest for robustness, scalability, and portability in (spoken) language applications Linguistics Methodology meets Language Reality: Bob Carpenter SpeechWorks International

2 2 The Standard Cliché(s) Moore’s Cliché: –Exponential growth in computing power and memory will continue to open up new possibilities The Internet Cliché: –With the advent and growth of the world-wide web, an ever increasing amount of information must be managed

3 3 More Standard Clichés The Convergence Cliché: –Data, voice and video networking will be integrated over a universal network, that: includes land lines and wireless; includes broadband and narrowband likely implementation is IP (internet protocol) The Interface Cliché: –The three forces above (growth in computing power, information online, and networking) will both enable and require new interfaces –Speech will become as common as graphics

4 4 Some Comp Ling Clichés The Standard Linguist’s Cliché –But it must be recognized that the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term. –Noam Chomsky, 1969 [essay on Quine] The Standard Engineer’s Cliché –Anytime a linguist leaves the group the recognition rate goes up. –Fred Jelinek, 1988 [address to DARPA]

5 5 The “Theoretical Abstraction” mature, monolingual, native language speaker –idealized to complete knowledge of language static, homogenous language community –all speakers learn identical grammars “competence” (vs. “performance”) –“performance” is a natural class –wetware “implementation” follows theory in divorcing “knowledge of language” from processing assumes the existence and innateness of a “language faculty”

6 6 The Explicit Methodology “Emprical” Basis is binary grammaticality judgements –“intuitive” (to a “properly” trained linguist) –innateness and the “language faculty” –appropriate for phonetics through dialogue –in practice, very little agreement at boundaries and no standard evaluations of theories vs. data Models of particular languages –by grammars that generate formal languages –low priority for transformationalists –high priority for monostratalists/computationalists

7 7 The Holy Grail of Linguistics A grammar meta-formalism in which –all and only natural language grammars (idealized as above) can be expressed –assumed to correspond to the “language faculty” Grail is sought by every major camp of linguist –Explains why all major linguistic theories look alike from any perspective outside of a linguistics department –The expedient abstractions have become an end in themselves

8 8 But, Applications Require Robustness –acoustic and linguistic variation –disfluencies and noise Scalability –from embedded devices to palmtops to clients to servers –across tasks from simple to complex –system-initiative form-filling to mixed initiative dialogue Portability –simple adaptation to new tasks and new domains –preferably automated as much as possible

9 9 The $64,000 Question How do humans handle unrestricted language so effortlessly in real time? Unfortunately, the “classical” linguistic assumptions and methodology completely ignore this issue Psycholinguistics has uncovered some baselines: –lexicon (and syntax?): highly parallel –time course of processing: totally online –information integration: <= 200ms for all sources But is short on explanations

10 10 (AI) Success by Stupidity Jaime Carbonell’s Argument (ECAI, mid 1990s) Apparent “intelligence” because they’re too limited to do anything wrong: “right” answer hardcoded Typical in Computational NL Grammars –lexicon limited to demo –rules limited to common ones (eg: no heavy shift) Scaling up usually destroys this limited “success” –1,000,000s of “grammatical” readings with large grammars

11 11 My Favorite Experiments: I Mike Tanenhaus et al. (Univ. Rochester) Head-Mounted Eye Tracking Eyes track Semantic resolution ~200 ms tracking time Pick up the yellow plate Clearly shows that understanding is online

12 12 My Favorite Experiments (II) Garden Paths and Context Sensitive –Crain & Steedman (U.Connecticut & U. Edinburgh) –if noun is not unique in context, postmodificiation is much more likely than if noun picks out unique individual Garden Paths are Frequency and Agreement Sensitive –Tanenhaus et al. –The horse raced past the barn fell. (raced likely past) –The horses brought into the barn fell. (brought likely participle, and less likely activity for horses)

13 13 Stats: Explanation or Stopgap A Common View –Statistics are some kind of approximation of underlying factors requiring further explanation. Steve Abney’s Analogy (AT&T Labs) –Statistical Queueing Theory –Consider traffic flows through a toll gate on a highway. –Underlying factors are diverse, and explain the actions of each driver, their cars, possible causes of flat tires, drunk drivers, etc. –Statistics is more insightful [explanatory] in this case as it captures emergent generalizations –It is a reductionist error to insist on low-level account

14 14 Competence vs. Performance What is computed vs. how it is computed The what can be traditional grammatical structure All structures not computed, regardless of the how Define what probabilistically, independently of how

15 15 Algebraic vs. Statistical False Dichotomy –All statistical systems have an algebraic basis, even if trivial The Good News: –Best statistical systems have best linguistic conditioning (most “explanatory” in traditional sense) –Statistical estimatiors far less significant than the appropriate linguistic conditioning –Rest of the talk provides examples of this

16 16 Bayesian Statistical Modeling Concerned with prior and posterior probabilities Allows updates of reasoning Bayes’ Law: P(A,B) = P(A|B) P(B) = P(B|A) P(A) Eg: Source/Channel Model for Speech Recognition –Ws: sequence of words –As: sequence of acoustic observations –Compute ArgMax_Ws P(Ws|As) ArgMax_Ws P(Ws|As) = ArgMax_Ws P(As|Ws) P(Ws) / P(As) = ArgMax_Ws P(As|Ws) P(Ws) P(As|Ws) : acoustic model P(Ws) : language model

17 17 Simple Bayesian Update Example Monty Hall’s Let’s Make a Deal Three curtains with prize behind one, no other info Contestant chooses one of three Monty then opens curtain of one of others that does not have the prize – if you choose curtain 2, then one of curtain 1 or 3 must not contain prize Monty then lets you either keep your first guess, or change to the remaining curtain he didn’t open. Should you switch, stay, or doesn’t it matter?

18 18 Answer Yes! You should switch. Why? Consider possiblities: Stay P(win) = 1/3 Switch P(win) = 2/3 prize behind you select

19 19 Defaults via Bayesian Inference Bayesian Inference provides an explanation for “rationality” of default reasoning Reason by choosing an action to maximize expected payoff given some knowledge –ArgMax_Action Payoff(Action) * P(Action|Knowledge) Given additional information update to Knowledge’ –ArgMax_Action Payoff(Action) * P(Action|Knowledge’) –Chosen action may be different, as in Let’s Make a Deal Inferences are not logically sound, but are “rational” Bayesian framework integrates partiality and uncertainty of background knowledge

20 20 Example: Allophonic Variation English Pronunciation (M. Riley & A. Llolje, AT&T) Derived from TIMIT with phoneme/phone labels –orthographic: bottle –phonological: / b aa t ax l / (ARPAbet phonemes) –phonetic: 0.75 [ b aa dx el ] (TIMITbet phones) – 0.13 [ b aa t el ] – 0.10 [ b aa dx ax l ] – 0.02 [ b aa t ax l ] Allophonic variation is non-deterministic

21 21 Eg: Allophonic Variation (cont’d) Simple statistical model (simplified w/o insertion) Estimate probability of phones given phonemes: P(a1,…,aM|p1,…,pM) = P(a1|p1,…,pM) * P(a2|p1,…,pM,a1) * … * * P(aM|p1,…,pM,a1,…,aM-1) Approximate phoneme context to +/- k phones Approximate phone history to 0 or 1 phones – 0: … P(aJ|pJ-K,…,pJ,…,pJ+K)... – 1: … P(aJ|pJ-K,…,pJ,…,pJ+K, aJ-1) … Uses word boundary marker and stress

22 22 Eg: Allophonic Variation (concl’d) Cluster phonological features using decision trees Sparse data smoothed by decision trees over standard features ( +/- stop, voicing, aspiration, etc.) Conditional entropy: w/o context 1.5 bits, w 0.8 Most likely allophone correct 85.5%, in top 5, 99% Average 17 pronunciations/word to get 95% Robust: handles multiple pronunciations Scalable: to whole of English pronunciation Portable: easy to move to new dialects with training –K. Knight (ISI): similar techniques for Japenese pronunciation of English words!

23 23 Example: Co-articulation HMMs have been applied to speech since mid-70s Two major recent improvements, the first being simply more training data and cycles Second is: Context-dependent triphones Instead of one HMM per phoneme/phone, use one per context-dependent triphone –example: t-r+u ‘an r preceded by t and followed by u’ –crucially clustered by phonological features to overcome sparsity

24 24 Exploratory Data Analysis (Trendier: data mining; Trendiest: information harvesting) Specious Argument: A statistical model won’t help explain linguistic processes. Counter 1: Abney’s anti-reductionist But even if you don’t believe that: Counter 2: In “other sciences” (pace linguistic tradition), statistics is used to discover regularities Allophone example: “had your” pronunciation –/ d / is 51%likely to realize as [ jh ], 37% as [ d ] –if / d / realizes as [ jh ], / y / deletes 84% –if / d / realizes as [ d ], / y / deletes 10%

25 25 Balancing Gricean Maxims Grice gives us conflicting maxims: –quantity (exactly as informative as required) –quality (try to make your contribution true) –manner (be perspicuous; eg. avoid ambiguity, be brief) Manner pulls in opposite directions –quality without ambiguity lengthens statements –quantity and and (part of) manner require brevity Balance by estimating a multidimensional “goodness” metric for generation

26 26 Gricean Balance (cont’d) Consider problem for aggregation in generation –Every student ran slowly or walked quickly. Aggregates to: –Every student ran slowly or every student walked quickly. This reduces sentence length, shortens clause length, and increases ambiguity. These tradeoffs need to be balanced

27 27 Collins’ Head/Dependency Parser Michael Collins 1998 UPenn PhD thesis Parses WSJ with ~90% constituent precision/recall Generative model of tree probabilities Clever Linguistic Decomposition and Training –P(RootCat, HeadTag, HeadWord) –P(DaughterCat|MotherCat, HeadTag, HeadWord) –P(SubCat|MotherCat, DtrCat, HeadTag, HeadWord) –P(ModifierCat, ModiferTag, ModifierWord | SubCat, MotherCat, DaughterCat, HeadTag, HeadWord, Distance)

28 28 Eg: Collins’ Parser (cont’d) Distance encodes heaviness Adjunct vs. Complement modifiers distinguished Head Words and Tags model lexical variation and word-word attachment preferences Also conditions punctuation, coordination, UDCs 12,000 word vocabulary plus unknown word attachment model (by Collins) and tag model (by A. Ratnaparkhi, another 1998 UPenn thesis) Smoothed by backing off words to categories Trivial statistical estimators; power is conditioning

29 29 Computational Complexity Wide coverage linguistic grammar generate millions of readings But Collins’ parser runs faster than real time on a notebook on unseen sentences of length up to 100 How? Pruning. Collins’ found tighter statistical estimates of tree likelihoods with more features and more complex grammars ran faster because a tighter beam could be used –(E. Charniak & S. Caraballo at Brown have really pushed the envelope here)

30 30 Complexity (cont’d) Collins’ parser is not complete in the usual sense But neither are humans (eg. garden paths) Can trade speed for accuracy in statistical parsers Syntax is not processed autonomously –Humans can’t parse without context, semantics, etc. –Even phone or phoneme detection is very challenging, especially in a noisy environment –Top-down expectations and knowledge of likely bottom- up combinations prune the vast search space on line –Question is how to combine it with other factors

31 31 N-best and Word Graphs Speech recognizers can return n-best histories –flights from Boston today –flights from Austin today –flights for Boston to pay –lights for Boston to pay Can also return a packed word graph of histories; sum of path log probs equal acoustics / word-string joint log prob flights lights from for Boston Austin today to pay

32 32 Probabilistic Graph Processing The architecture we’re exploring in the context of spoken dialogue systems involves: –Speech recognizers that produce probabilistic word graph output –A tagger that transforms a word graph into a word/tag graph with scores given by joint probabilities –A parser that transforms a word/tag graph into a graph- based chart (as in CKY or chart parsing) Allows each module to rescore output of previous module’s decision Apply this architecture to speech act detection, dialogue act selection, and in generation

33 33 Prices rose sharply after hours 15-best as a word/tag graph + minimization prices:NNS prices: NN rose:VBD rose:VBP rose:NN sharply:RB after:IN after:RB after:IN after:RB hours:NNS rose:VBD rose:NNP

34 34 Challenge: Beat n-grams Backed off trigram models estimated from 300M words of WSJ provide best language models We know there is more to language than two words of history Challenge is to find out how to model it.

35 35 Conclusions Need ranking of hypotheses for applications Beam can reduce processing time to linear –need good statistics to do this More linguistic features are better for stat models –can induce the relevant ones and weights from data –linguistic rules emerge from these generalizations Using acoustic / word / tag / syntax graphs allows the propogation of uncertainty –ideal is totally online (model is compatible with this) –approximation allows simpler modules to do first pruning

36 36 Plugs Run, don’t walk, to read: Steve Abney. 1996. Statistical methods and linguistics. In J. L. Klavans and P. Resnik, eds., The Balancing Act. MIT Press. Mark Seidenberg and Maryellen MacDonald. 1999. A probabilistic constraints approach to language acquisition and processing. Cognitive Science. Dan Jurafsky and James H. Martin. 2000. Speech and Language Processing. Prentice-Hall. Chris Manning and Hinrich Schuetze. 1999. Statistical Natural Language Processing. MIT Press.

Download ppt "The quest for robustness, scalability, and portability in (spoken) language applications Linguistics Methodology meets Language Reality: Bob Carpenter."

Similar presentations

Ads by Google