Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning for (Psycho-)Linguistics Walter Daelemans CNTS, University of Antwerp ILK, Tilburg University.

Similar presentations


Presentation on theme: "Machine Learning for (Psycho-)Linguistics Walter Daelemans CNTS, University of Antwerp ILK, Tilburg University."— Presentation transcript:

1 Machine Learning for (Psycho-)Linguistics Walter Daelemans daelem@uia.ua.ac.be http://cnts.uia.ac.be CNTS, University of Antwerp ILK, Tilburg University QITL-02

2 Outline Machine Learning of Language –Induction of rules and classes –Learning by Analogy Case Studies –Discovery of phonological categories and morphological rules –A single-route model of morphological processing Issues –Probabilities versus symbolic structure induction –Nativism versus empiricism –Exemplar analogy versus rules

3 OutputInput Performance Component RiRi RkRk RlRl RjRj Learning Component Search Experience BIAS

4 Problems with Probabilities Explanation –Also applies to neural networks Event relevance –Especially in unsupervised learning (clustering) Incorporation of linguistic knowledge Smoothing zero-frequency events

5 (symbolic) machine learning Rule induction (understandable induced theories) Inductive Logic Programming (incorporating linguistic knowledge) Memory-based learning (similarity-based smoothing of sparse data, feature weighting) …

6 Common Fallacies Rules = nativism (and connections = empiricism) Generalization = abstraction (and memory = table-lookup)

7 Rule-Based  Innate Rules can be induced from primary linguistic data as well Applications in Linguistics –Evaluation and comparison of linguistic hypotheses –Discovery of linguistic generalizations and categories

8 Allomorphy in Dutch Diminutive “one of the more spectacular phenomena of modern Dutch morphophonemics” Trommelen (1983) Base form of Noun + [tje] (5 variants) Linguistic theory (from Te Winkel 1862) Rime last syllable, stress, morphological structure, … Trommelen 1983 Local phenomenon, stress & morphological structure do not play a role CELEX data (3900 nouns) - b i = - z @ = + m A nt  je

9 Allomorphs -tjekikker-tje189648.050.9 -etjeroman-etje39510.010.9 -pjelichaam-pje1042.64.0 -kjekoning-kje771.93.8 -jewereld-je147837.430.4

10 Decision Tree Learning Given a data set, construct a decision tree that reflects the structure of the domain A decision tree is a tree where –non-leaf nodes represent features (tests) –branches leading out of a test represent possible values for the feature –leaf nodes represent outcomes (classes) Decision Tree can be translated into a set of IF- THEN rules (with further optimization) Value grouping

11 Given a set of examples T –If T contains one or more cases all belonging to the same class C, then the decision tree for T is a leaf node with category C. –If T contains different classes then Choose a feature, and partition T into subsets that have the same value for the feature chosen. The decision tree consists of a node containing the feature name, and a branch for each value leading to a subset. Apply the procedure recursively to subsets created this way. Decision Tree Construction

12 Induced rule set Default class is -tje 1.IF coda last is /lm/ or /rm/ THEN -pje 2.IF nucleus last is [+bimoraic] AND coda last is /m/ THEN -pje 3.IF coda last is /N/ THEN IF nucleus penultimate is empty or schwa THEN -etje ELSE -kje 4.IF nucleus last is [+short] and coda last is [+nas] or [+liq] THEN -etje 5.IF coda last is [+obstruent] THEN -je

13 Results Problem is almost perfectly learnable (98.4%) More than last syllable is needed for a full solution Only rime of last syllable (not stress or onset) is relevant Induced Categories –Nasals, liquids, obstruents, short vowels, bimoraic vowels (consists of vowels, diphtongs, schwa) –Task-dependent categories? Category formation is dependent on the task to be learned, not absolute, not language-independent

14 Conclusions: Rule Induction in Linguistics Falsify existing linguistic theories Evaluate role of linguistic information sources (Re)discover interesting linguistic rules (= supervised learning) (Re)discover interesting linguistic categories (= unsupervised learning) Empiricist alternative for (mostly nativist) rule- based systems

15 There is one small problem … Current methodology for comparative machine learning experiments is not reliable (especially with small data) –Different runs of the algorithm provide different resulting rule sets –Algorithm can be tweaked to get high performance with any information source combination –Algorithm is highly sensitive to training data, feature selection, algorithm parameter settings, … Only to be used as a heuristic –As with your own rule induction module

16 Word Sense Disambiguation (do) Similar: experience, material, say, then, … 61.0 60.8 Optimized parameters 59.560.8Optimized parameters LC 47.949.0Default + keywords Local Context

17 Generalisation  Abstraction + abstraction - abstraction + generalisation- generalisation Rule Induction Connectionism Inductive Logic Programming Statistics Handcrafting Table Lookup Memory-Based Learning … (Fill in your most hated linguist here)

18 This “rule of nearest neighbor” has considerable elementary intuitive appeal and probably corresponds to practice in many situations. For example, it is possible that much medical diagnosis is influenced by the doctor's recollection of the subsequent history of an earlier patient whose symptoms resemble in some way those of the current patient. (Fix and Hodges, 1952, p.43) MBL: Use memory traces of experiences as a basis for analogical reasoning, rather than using rules or other abstractions extracted from experience and replacing the experiences.

19 -etje -kje Coda last syl Nucleus last syl Rule Induction

20 ? -etje -kje Coda last syl Nucleus last syl MBL

21 Memory-Based Learning Basis: k nearest neighbor algorithm: –store all examples in memory –to classify a new instance X, look up the k examples in memory with the smallest distance D(X,Y) to X –let each nearest neighbor vote with its class –classify instance X with the class that has the most votes in the nearest neighbor set Choices: –similarity metric –number of nearest neighbors (k) –voting weights

22 Metrics

23 Metrics (2) Voting options: Equal weight for each nearest neighbor Distance weighted voting Inverse distance 1/D(X,Y) (Wettschereck, 1994) RBF-style gaussian voting function (Shepard, 1987) Linear voting function (Dudani, 1976) (NB: weighted NN distribution can be used as conditional probability)

24 MBL Acquisition Inflectional process is represented by a set of exemplars in memory –Exemplars act as models –Learning is incremental storage of exemplars –Compression and Metrics Exemplar consists of set of (mostly symbolic) features

25 MBL Processing New instances of a performance process are solved through –Memory-lookup –Analogical (Similarity-Based) Reasoning Similarity metric –Language (faculty) - independent –Adaptive (feature and exemplar weighting)

26 The properties of language processing tasks … Language processing tasks are mappings between linguistic representation levels that are –context-sensitive (but mostly local!) –complex (sub/ir/regularity), pockets of exceptions Similar representations at one linguistic level correspond to similar representations at the other level Several information sources interact in (often) unpredictable ways at the same level Data is sparse

27 … fit the bias of MBL Inference is based on Similarity-Based / Analogical Reasoning Adaptive data fusion / relevance assignment is available through feature weighting It is a non-parametric approach Similarity-based smoothing is implicit Regularities and subregularities / exceptions can be modeled uniformly

28 German and Dutch plurals

29 Data & Representation Symbolic features –segmental information (syllable structure) –stress –gender German Plural (~ 25,000 from CELEX) Vorlesung (lecture)l e - z U N Fen Classes: e (e)n s er - U- Uer Ue Dutch Plural (~ 62,000 from CELEX) ontruiming (evacuation)0 - O nt 1 r L - 0 m I N en Classes: (e)n s (-eren, -i, -a, …)

30 Cognitive Architectures of Inflectional Morphology Dual Route (Pinker, Clahsen, Marcus …) –Rules for regular cases (over)generalization default behaviour –Associative memory for exceptions irregularization / family effects Single Route (R&M, MacWhinney, Plunkett, Elman, …) –Frequency-based regularity Dual Route Pattern Associator Rule Input Features Suffix-class Memory Failure

31 German Plural Notoriously complex but routinely acquired (at age 5) Evidence for Dual Route ? -s suffix is default/regular (novel words, surnames, acronyms, …) -s suffix is infrequent (least frequent of the five most important suffixes)

32

33 The default status of -s Similar item missingFnöhk-s Surname, product nameMann-s BorrowingsKiosk-s AcronymsBMW-s Lexicalized phrasesVergissmeinnicht-s Onomatopoeia, truncated roots, derived nouns,...

34

35 Discussion Three “classes” of plurals: ((-en -)(-e -er))(s) the former 4 suffixes seem “regular”, can be accurately learned using information from phonology and gender -s is learned reasonably well but information is lacking Hypothesis: more “features” are needed (syntactic, semantic, meta-linguistic, …) to enrich the “lexical similarity space” No difference in accuracy and speed of learning with and without Umlaut Overall generalization accuracy very high: 95% Schema-based learning (Köpcke). *,*,*,*,i,r,Me

36

37

38 Acquisition Data: Summary of previous studies Existing nouns: (Park 78; Veit 86; Mills 86; Schamer-Wolles 88; Clahsen et al. 93; Sedlak et al. 98) –Children mainly overapply -e or -(e)n –-s plurals are learned late Novel words: (Mugdan 77; MacWhinney 78; Phillis & Bouma 80; Schöler & Kany 89) –Children inflect novel words with -e or -(e)n –More “irregular” plural forms produced than “defaults”

39 MBL simulation model overapplies mainly -en and -e -s is learned late and imperfectly Mainly but not completely parallel to input frequency (more -s overgeneralization than -er generalization)

40 Bartke, Marcus, Clahsen (1995) 37 children age 3.6 to 6.6 pictures of imaginary things, presented as neologisms –names or roots –rhymes of existing words or not –choice -en or -s results: –children are aware that unusual sounding words require the default –children are aware that names require the default

41 MBL simulation sort CELEX data according to rhyme compare overgeneralization –to -en versus to -s –percentage of total number of errors results: –when new words don’t rhyme more errors are made –overgeneralization to -en drops below the level of overgeneralization to -s

42 Dutch Plural Suffixes -en and -s are both defaults, and are in complementary distribution Selection of -en or -s governed by –phonological structure of the base noun (stressed vs. unstressed last syllable) –morphological structure (suffix of the base noun) –loan word status –semantic feature person vs. thing –both are possible after /  / (Baayen et al. 2001)

43 Feature Relevance

44 Accuracy on CELEX Methodology –“Leave-one-out” Results: –MBL94.9 % accuracy PrecRecF  -(e)n95.897.296.4 -s93.891.492.6 -i82.077.279.5 –without stress94.9 % accuracy –last syllable with stress92.6 % accuracy –last syllable without stress92.4 % accuracy –rhyme last syllable89.6 % accuracy

45 Accuracy on pseudo-words Methodology –Train: Celex (all) and Celex (1000 most frequent types) –Test: 8 * 10 pseudo-words (Baayen et al., 2001) dreip - workel - bastus - bestroeting - kloertje stape - stree - kadisme Results: accuracy = number of decisions equal to subject majority for each item –Subjects87.5 % –MBL (all)83.8 % –MBL (top 1000)90.0 %

46 muidus, muidi nn: modus, modi Celex bias Low frequency and loan word nearest neighbours

47 Conclusions Memory-Based Single Route MBLP picks up the main “schemata” of Dutch and German plural formation and their exceptions without recourse to explicit rules or a dual route architecture MBLP trained on (part of) CELEX matches subject behavior on pseudo words and acquisition data Segmental information suffices to reliably predict plural in Dutch and most plurals in German, additional information needed for German -s Heterogeneity and density in lexical exemplar space as source of behavior predictions

48 Overall Conclusions Advantages of symbolic machine learning methods over ‘pure statistics’ –As a methodology for inducing interpretable linguistic generalizations and categories –As a way of introducing an operationalisation of analogy-based methods into (psycho)linguistics


Download ppt "Machine Learning for (Psycho-)Linguistics Walter Daelemans CNTS, University of Antwerp ILK, Tilburg University."

Similar presentations


Ads by Google