Presentation is loading. Please wait.

Presentation is loading. Please wait.

Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU.

Similar presentations


Presentation on theme: "Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU."— Presentation transcript:

1 Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU

2 Natural Language Processing + LinguisticsComputer Science

3 Natural Language Processing But Why ? Inability to handle large amount of data Much much faster information access

4 Natural Language Processing How can this be done ? Can you teach a computer ?

5 Natural Language Processing = Mathematics Using Maths to learn language ??? Are you kidding me !

6 Machine Learning Teaching computers make decisions like humans Computer vision Machine Translation Clustering

7 Machine Learning SupervisedUnsupervisedSemi- supervised Learning by examples Learning by patterns Learning by patterns + examples

8 Formal & Informal address Most languages distinguish formal (V) and informal (T) address in direct speech (Brown and Gilman, 1960) Formal address: Neutrality, distance Informal address: Friends, subordinates Variety of realization in different languages French: Pronoun usage (Vous/Tu) German: Pronoun usage (Sie/Du) Hindi: Pronoun usage (Aap/Tum) Japanese: Verbal inflections English: ???

9 Main goals of this work Goal 1: Determine whether English distinguishes between V & T consistently If yes, what are the indicators ? Goal 2: Develop a computational model that labels English sentences as T or V Ideally without spending effort on annotation

10 Methodology Use a parallel corpus to analyze aligned sentences with overt (De) T/V choice and covert (En) T/V choice For Goal 1: Compare De & En sentences For Goal 2 : Project De labels onto En sentences

11 Digression: Creation of a parallel corpus Current parallel corpora not suitable Europarl: Overwhelmingly formal (99%) Newswire: No dialogue Creation of a new corpus: De-En literary texts th century novels (Project Gutenberg) Sentence-aligned: Gargantuan (Braune & Fraser 2010) POS-tagged (Schmidt 1994) German sentence can be labeled as T, V or None Using orthographic rules Corpus:

12 Goal 1: Compare De and En address Give English monolingual text to human annotators Ask for T/V judgment Their annotation provides the following information How well do annotators agree on English text? Does English monolingual text provide enough information to identify T/V? (1a) How well do annotators agree with copied labels? Is there a direct correspondence ? (1b) Only if this is the case is the copying of labels appropriate

13 Experiment 1: Human Annotation 200 randomly drawn English sentences Two annotators (“A1”, “A2”) Two conditions: – No context: just one sentence – In context: three sentences pre- and post-context each

14 Results: Reliability Context improves reliability – Many sentences can not be tagged with T/V in isolation “And she is a sort of relation of your lordship’s,” said Dawson. “And perhaps sometime you may see her.” Reliability in context is reasonable: English does provide strong clues on T/V No ContextIn Context A1 vs. A2.75 (k=.49).79 (k=.58) Goal 1a ✓

15 Results: Correspondence No ContextIn Context (A1 ∩ A2) vs. Projection.67 (k=.34).79 (k=.58) Agreement with German projected labels again reasonable, but not perfect Error analysis showed strong influence of social norms Example: Lovers in 19 th cent. novels use V (!) [...] she covered her face with the other to conceal her tears. “Corinne!”, said Oswald, “Dear Corinne! My absence has then rendered you unhappy!” Goal 1b ✓

16 Experiment 2: Prediction of T/V Copy German T/V labels onto English: No annotation Learn L2-regularized logit classifier on train set; optimize on dev set; evaluate on test set Feature candidates : – Lexical features (bag-of-words, χ² feature selection) – Distributional semantic word classes 200 word classes clustered with the algorithm by Clark (2003) – Politeness theory (Brown & Levinson 2003) Polite speech has specific features, which are inherited by V

17 Supervised Learning Logistic regression classifier Linear combination of features Every feature assigned a weight acc. to its importance higher weight = more importance L2 regularization to avoid overfitting Used “Weka” as the open-source toolkit

18 Context As shown by human annotation: Individual sentences often insufficient for classification Simplest solution: Compute features over a window of context sentences – Problem: context typically includes non-speech sentences “I am going to see his ghost!” Lorry quietly chafed the hands that held his arm.

19 Context Our solution: A simple “direct speech” recognizer CRF-based sequence tagger (Mallet) trained on 1000 sentences Ideal results for 8 sentences of direct speech context +5% accuracy over no context Sentence context Speech context B-SP: “I am going to see his ghost!” O: Lorry quietly chafed the hands that held his arm.

20 Quantitative results ModelAccuracy Frequency BL (V)59.1 Lexical features 67.0 Semantic class features57.5 Politeness features59.6 Only lexical features yield significant improvement over frequency baseline Goal 2 ✓ (Faruqui & Pado, 2011; 2012)

21 Qualitative analysis: Lexical features Top 10 lexical features

22 Conclusions Formal and informal language exists in English as well – Indicators more dispersed across context Bootstrapping a T/V classifier for English possible Results still fairly modest – Asymmetry: V more marked than T → better features – Difficult to operationalize features with high recall (sociolinguistic features, first names, …)

23 References M. Faruqui & S. Pado, “I thou thee, thou traitor”: Predicting formal vs. informal address in English literature. ACL M. Faruqui & S. Pado, Towards a model of formal and informal address in English. EACL Roger Brown and Albert Gilman The pronouns of power and solidarity. In Thomas A. Sebeok, editor, Style in Language, pages 253–277. MIT Press, Cambridge, MA. Penelope Brown and Stephen C. Levinson Politeness: Some Universals in Language Usage. Number 4 in Studies in Interactional Sociolinguistics. Cambridge University Press. Fabienne Braune & Alexander Fraser. Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. COLING 2010 Helmut Schmid Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of the International Conference on New Methods in Language Processing, pages 44–49, Manchester, UK. Andrew Kachites McCallum Mallet: A machine learning for language toolkit.

24 Unsupervised Learning Learning by finding patterns in data Clustering

25 Word clustering Why ? Feature reduction From words to word classes Generalization of unseen words Bangalore ~ Bengaluru Identification of words with similar meaning Word-sense disambiguation Reduces the need for tagged data

26 Word clustering How ? Distributional similarity How similar is the occurrence pattern of two words in a given corpus ? “You shall know a word by the company it keeps” – J. R. Firth Morphological similarity How similar are two words orthographically ? Madras ~ Chennai … NO Bangalore ~ Bengaluru … YES

27 Word clustering Language modeling approach 1. Ranjitha cooks Uttapam. 2. Ranjitha cooks Rava masala dosa. 3. Ranjitha cooks Facebook. How do you know which one is wrong ??

28 Word clustering Language modeling approach Maximize the probability of occurrence of a sequence of words S: Ranjitha cooks Facebook P(S) = P(Ranjitha) * P(cooks|Ranjitha) * P(Facebook|cooks) P(Facebook|cooks) will be very near zero OR zero !

29 Word clustering W1W1 W2W2 W4W4 W3W3 C3C3 C2C2 C4C4 C1C1 S: w 1 w 2 w 3 w 4 P(S) = P(C 1 ) * P(w 1 |C 1 ) * P(C 2 |C 1 ) * P(w 2 |C 2 ) * … (Och, 1999) This is called a Hidden-Markov Model (HMM)

30 Word clustering Adding morphology (Clark, 2003) W1W1 W2W2 W4W4 W3W3 C3C3 C2C2 C4C4 C1C1 P(S) = P(C 1 ) * P(w 1 |C 1 ) * P m (w 1 |C 1 ) * P(C 2 |C 1 ) * P(w 2 |C 2 ) * P m (w 2 |C 2 ) …

31 Word clustering Implementation Initialization of clusters Randomized Heuristic-based Optimization algorithm Greedy as closed form solution not present Transfer word to the cluster with highest improvement Termination Till no more words are exchanged Till a specific no. of words are exchanged

32 Word clustering Application / Evaluation Named Entity Recognition Identification and labeling of names of people, places, organization etc. Pre-processing task for many NLP applications Tags from the CoNLL-03 shared-task on NER: PERson, ORGanization, LOCation, MISCellaneous (Sonia Gandhi) PER is an (Italian) MISC who lives in (India) LOC.

33 Named Entity Recognition NER for German: Challenges Sparse data: Only one NE- tagged dataset (CoNLL 2003), just 0.2M tokens Sparse data: Only one NE- tagged dataset (CoNLL 2003), just 0.2M tokens Complex morphology: Difficult lemmatization Complex morphology: Difficult lemmatization Common noun capitalization: no easy entity detection Common noun capitalization: no easy entity detection Poor Performance, in particular poor Recall Sparse data: Only one NE- tagged dataset (CoNLL 2003), just 0.2M tokens Sparse data: Only one NE- tagged dataset (CoNLL 2003), just 0.2M tokens Complex morphology: Difficult lemmatization Complex morphology: Difficult lemmatization Common noun capitalization: no easy entity detection Common noun capitalization: no easy entity detection Poor Performance, in particular poor Recall Sparse data: Only one NE- tagged dataset (CoNLL 2003), just 0.2M tokens Sparse data: Only one NE- tagged dataset (CoNLL 2003), just 0.2M tokens Complex morphology: Difficult lemmatization Complex morphology: Difficult lemmatization Common noun capitalization: no easy entity detection Common noun capitalization: no easy entity detection Poor Performance, in particular poor Recall Sparse data: Only one NE- tagged dataset (CoNLL 2003), just 0.2M tokens Sparse data: Only one NE- tagged dataset (CoNLL 2003), just 0.2M tokens Complex morphology: Difficult lemmatization Complex morphology: Difficult lemmatization Common noun capitalization: no easy entity detection Common noun capitalization: no easy entity detection Poor Performance, in particular poor Recall Sparse data: Only one NE- tagged dataset (CoNLL 2003), just 0.2M tokens Sparse data: Only one NE- tagged dataset (CoNLL 2003), just 0.2M tokens Complex morphology: Difficult lemmatization Complex morphology: Difficult lemmatization Common noun capitalization: no easy entity detection Common noun capitalization: no easy entity detection Poor Performance, in particular poor Recall Complex morphology: Difficult lemmatization Complex morphology: Difficult lemmatization Complex morphology: Difficult lemmatization Complex morphology: Difficult lemmatization Complex morphology: Difficult lemmatization Complex morphology: Difficult lemmatization Complex Morphology: Difficult lemmatization Sparse data: Only one NE- tagged dataset (CoNLL 2003) Common noun capitalization: no easy entity detection Poor performance, in particular poor Recall

34 Named Entity Recognition NER for German: Challenges RecallPrecisionF-Score English88.5%89.0%88.8% German63.7%83.9%72.4% Recall is a problem ! More amount of training data can help, but expensive ! Semantic generalization ?

35 Named Entity Recognition Word clustering Provides a way to semantic generalization But how can it help ? Deutschland (70) Ostdeutschland(0) Westdeutschland(5) Deutschland (70) Ostdeutschland(0) Westdeutschland(5) LOC

36 Named Entity Recognition Experimental setup Cluster German words with Clark’s clustering software on the basis of an untagged generalization corpus HGC, deWac (Baroni et. al, 2009) Stanford’s CRF-based NER system (Finkel and Manning 2009) Training on an NER-tagged corpus (CoNLL 2003 German train set newswire) Evaluate on CoNLL 2003 testb set (50M words, in-domain)

37 Named Entity Recognition Results (Faruqui & Pado, 2010) RecallPrecisionF-Score Florian et. al %63.7%72.4% Baseline (0/0)84.5%63.1%72.3% HGC (175m/600)86.6%71.2%78.2% deWac (175m/400)86.4%68.5%76.4%

38 Multilingual word clustering Clustering words from two languages together If parallel data in two languages available Word alignments can give additional information Additional constraints may give better clustering I You We They She I You We They She Ich Sie Uns Er Ich Sie Uns Er

39 Multilingual word clustering Language 1 Language 2

40 Multilingual word clustering Language 1 Language 2

41 Multilingual word clustering Minimize the randomness of the clustering Minimize the entropy of the clustering If clustering of L 1 is represented by a random variable X We want to minimize the entropy of one clustering given the other:

42 Multilingual word clustering We optimize both the monolingual and multilingual objective together: Further edge filtering heuristics can be used Words aligned with stop words generally noisy Low frequency words are important Finding out whether edge filtering is language dependent or not

43 References M. Faruqui & S. Pado, Training and Evaluating a German Named Entity Recognizer with Semantic Generalization, KONVENS Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta The wacky wide web: A collection of very large linguistically processed web- crawled corpora. JLRE, 43(3):209–226. Alexander Clark Combining distributional and morphological information for part of speech induction. Proc. EACL 59–66, Budapest, Hungary. Jenny Rose Finkel and Christopher D. Manning Nested named entity recognition. Proc. EMNLP, pages 141–150, Singapore. Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang Named entity recognition through classifier combination. Proc. CoNLL, pages 168– 171. Edmonton. Erik F. Tjong Kim Sang and Fien De Meulder Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. Proc. CoNLL, pages 142–147, Edmonton, AL

44 Thank you! Questions? Please write to: Or visit:


Download ppt "Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU."

Similar presentations


Ads by Google