Presentation is loading. Please wait.

Presentation is loading. Please wait.

Part-of-speech tagging and chunking with log-linear models University of Manchester Yoshimasa Tsuruoka.

Similar presentations


Presentation on theme: "Part-of-speech tagging and chunking with log-linear models University of Manchester Yoshimasa Tsuruoka."— Presentation transcript:

1 Part-of-speech tagging and chunking with log-linear models University of Manchester Yoshimasa Tsuruoka

2 Outline POS tagging and Chunking for English –Conditional Markov Models (CMMs) –Dependency Networks –Bidirectional CMMs Maximum entropy learning Conditional Random Fields (CRFs) Domain adaptation of a tagger

3 Part-of-speech tagging The tagger assigns a part-of-speech tag to each word in the sentence. The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NN virus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS

4 Algorithms for part-of-speech tagging Tagging speed and accuracy on WSJ Tagging SpeedAccuracy Dependency Net (2003)Slow?97.24 SVM (2004)Fast97.16 Perceptron (2002)?97.11 Bidirectional CMM (2005)Fast97.10 HMM (2000)Very fast96.7* CMM (1998)Fast96.6* * evaluated on different portion of WSJ

5 Chunking (shallow parsing) A chunker (shallow parser) segments a sentence into non-recursive phrases He reckons the current account deficit will narrow to NP VP NP VP PP only # 1.8 billion in September. NP PP NP

6 Chunking (shallow parsing) Chunking tasks can be converted into a standard tagging task Different approaches: –Sliding window –Semi-Markov CRF –… He reckons the current account deficit will narrow to B NP B VP B NP I NP I NP I NP B VP I VP B PP only # 1.8 billion in September. B NP I NP I NP I NP B PP B NP

7 Algorithms for chunking Chunking speed and accuracy on Penn Treebank Tagging SpeedAccuracy SVM + voting (2001)Slow?93.91 Perceptron (2003)?93.74 Bidirectional CMM (2005)Fast93.70 SVM (2000)Fast93.48

8 Conditional Markov Models (CMMs) Left to right decomposition (with the first-order Markov assumption) t1t1 t2t2 t3t3 o

9 POS tagging with CMMs [Ratnaparkhi 1996; etc.] Left-to-right decomposition –The local classifier uses the information on the preceding tag. He runs fast PRP VBZRB ? ??

10 Examples of the features for local classification Word unigramw i, w i-1, w i+1 Word bigramw i-1 w i, w i w i+1 Previous tagt i-1 Tag/wordt i-1 w i Prefix/suffixUp to length 10 Lexical featuresHyphen, number, etc.. He runs fast PRP ?

11 POS tagging with Dependency Network [Toutanova et al. 2003] Use the information on the succeeding tag as well This is no longer a probability You can use the succeeding tag as a feature in the local classification model t1t1 t2t2 t3t3

12 POS tagging with a Cyclic Dependency Network [Toutanova et al. 2003] Training cost is small – almost equal to CMMs. Decoding can be performed with dynamic programming, but it is still expensive. Collusion – the model can lock onto conditionally consistent but jointly unlikely sequences. t1t1 t2t2 t3t3

13 Bidirectional CMMs [Tsuruoka and Tsujii, 2005] Possible decomposition structures Bidirectional CMMs –We can find the “ best ” structure and tag sequences in polynomial time t1t1 t2t2 t3t3 (a) t1t1 t2t2 t3t3 (b) t1t1 t2t2 t3t3 (c) t1t1 t2t2 t3t3 (d)

14 Bidirectional CMMs Another way of decomposition –The local classifiers have the information about the tags on both sides when tagging the second word. He runs fast PRP VBZRB ? ??

15 Outline POS tagging and Chunking for English –Conditional Markov Models (CMMs) –Dependency Networks –Bidirectional CMMs Maximum entropy learning Conditional Random Fields (CRFs) Domain adaptation of a tagger

16 Maximum entropy learning Log-linear modeling Feature function Feature weight

17 Maximum entropy learning Maximum likelihood estimation –Find the parameters that maximize the (log-) likelihood of the training data Regularization –Gaussian prior [Berger et al, 1996] –Inequality constrains [Kazama and Tsujii, 2005]

18 Parameter estimation Algorithms for maximum entropy –GIS [Darroch and Ratcliff, 1972], IIS [Della Pietra et al., 1997] General-purpose algorithms for numerical optimization –BFGS [Nocedal and Wright, 1999], LMVM [Benson and More, 2001] You need to provide the objective function and gradient: –Likelihood of training samples –Model expectation of each feature

19 Computing likelihood and model expectation Example –Two possible tags: “ Noun ” and “ Verb ” –Two types of features: “ word ” and “ suffix ” Verb He opened it Noun tag = noun tag = verb

20 Conditional Random Fields (CRFs) A single log-linear model on the whole sentence One can use exactly the same techniques as maximum entropy learning to estimate the parameters. However, the number of classes is HUGE, and it is impossible in practice to do it in a naive way.

21 Conditional Random Fields (CRFs) Solution –Let ’ s restrict the types of features –Then, you can use a dynamic programming algorithm that drastically reduces the amount of computation Features you can use (in first-order CRFs) –Features defined on the tag –Features defined on the adjacent pair of tags

22 Features Feature weights are associated with states and edges Noun Verb Noun Verb Noun Verb Noun Verb He has opened it W 0 =He & Tag = Noun Tag left = Noun & Tag right = Noun

23 A naive way of calculating Z(x) Noun = 7.2 = 1.3 = 4.5 = 0.9 = 2.3 = 11.2 = 3.4 = 2.5 = 4.1 = 0.8 = 9.7 = 5.5 = 5.7 = 4.3 = 2.2 = 1.9 Sum = 67.5 Noun Verb Noun VerbNoun Verb NounVerbNoun VerbNounVerb NounVerb Noun Verb Noun VerbNoun Verb NounVerbNoun VerbNounVerb Noun Verb NounVerb Noun Verb

24 Dynamic programming Results of intermediate computation can be reused. Noun Verb Noun Verb Noun Verb Noun Verb He has opened it forward

25 Dynamic programming Results of intermediate computation can be reused. Noun Verb Noun Verb Noun Verb Noun Verb He has opened it backward

26 Dynamic programming Computing marginal distribution Noun Verb Noun Verb Noun Verb Noun Verb He has opened it

27 Maximum entropy learning and Conditional Random Fields Maximum entropy learning –Log-linear modeling + MLE –Parameter estimation Likelihood of each sample Model expectation of each feature Conditional Random Fields –Log-linear modeling on the whole sentence –Features are defined on states and edges –Dynamic programming

28 Named Entity Recognition We have shown that interleukin-1 (IL-1) and IL-2 control protein protein protein IL-2 receptor alpha (IL-2R alpha) gene transcription in DNA CD4-CD8-murine T lymphocyte precursors. cell_line A term consists of multiple tokens We want to define features on a term rather than on a token. Semi-Markov CRFs [Sarawagi 2004]

29 Algorithms for Biomedical Named Entity Recognition RecallPrecisionF-score SVM+HMM (2004) Semi-Markov CRF [Okanohara et al., 2006] Sliding window MEMM (2004) CRF (2004) Shared task data for Coling 2004 BioNLP workshop

30 Outline POS tagging and Chunking for English –Conditional Markov Models (CMMs) –Dependency Networks –Bidirectional CMMs Maximum entropy learning Conditional Random Fields (CRFs) Domain adaptation of a tagger

31 Domain adaptation Large training data has been available for general domains (e.g. Penn Treebank WSJ) NLP Tools trained with general domain data are less accurate on biomedical domains Development of domain-specific data requires considerable human efforts

32 Tagging errors made by a tagger trained on WSJ Accuracy of the tagger on the GENIA POS corpus: 84.4% … and membrane potential after mitogen binding. CC NN NN IN NN JJ … two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN … to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN

33 Re-training of maximum entropy models Taggers trained as maximum entropy models Adapting Maximum entropy models to target domains by re-training with domain specific data Feature function (given by the developer) Model parameter

34 Methods for domain adaptation Combined training data: a model is trained from scratch with the original and domain-specific data Reference distribution: an original model is used as a reference probabilistic distribution of a domain-specific model

35 Adaptation of the part-of-speech tagger Relationships among training and test data are evaluated for the following corpora –WSJ: Penn Treebank WSJ –GENIA: GENIA POS corpus [Kim et al., 2003] 2,000 MEDLINE abstracts selected by MeSH terms, Human, Blood cells, and Transcription factors –PennBioIE: Penn BioIE corpus [Kulick et al., 2004] 1,100 MEDLINE abstracts about inhibition of the cytochrome P450 family of enzymes 1,157 MEDLINE abstracts about molecular genetics of cancer –Fly: 200 MEDLINE abstracts on Drosophia melanogaster

36 Training sets Test sets Training and test sets # tokens# sentences WSJ912,34438,219 GENIA450,49218,508 PennBioIE641,83829,422 Fly1,024 # tokens# sentences WSJ129,6545,462 GENIA50,5622,036 PennBioIE70,7133,270 Fly7,615326

37 Experimental results Accuracy Training time (sec.) WSJGENIAPennBioIEFly WSJ+GENIA +PennBioIE Fly only93.91 Combined ,632 Ref. dist

38 Corpus size vs. accuracy (combined training data)

39 Corpus size vs. accuracy (reference distribution)

40 Summary POS tagging –MEMM-like approaches achieve good performance with reasonable computational cost. CRFs seem to be too computationally expensive at present. Chunking –CRFs yield good performance for NP chunking. Semi- Markov CRFs are promising, but we need to somehow reduce computational cost. Domain Adaptation –One can easily use the information about the original domain as the reference distribution.

41 References A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. (1996). A maximum entropy approach to natural language processing. Computational Linguistics. Adwait Ratnaparkhi. (1996). A Maximum Entropy Part-Of-Speech Tagger. Proceedings of EMNLP. Thorsten Brants. (2000). TnT A Statistical Part-Of-Speech Tagger. Proceedings of ANLP. Taku Kudo and Yuji Matsumoto. (2001). Chunking with Support Vector Machines, Proceedings of NAACL. John Lafferty, Andrew McCallum, and Fernando Pereira. (2001). Conditional Random Fields,, Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of ICML. Michael Collins. (2002). Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. Proceedings of EMNLP. Fei Sha and Fernando Pereira. (2003). Shallow Parsing with Conditional Random Fields. Proceedings of HLT-NAACL. K. Toutanova, D. Klein, C. Manning, and Y. Singer. (2003). Feature-Rich Part-of- Speech Tagging with a Cyclic Dependency Network. Proceedings of HLT-NAACL.

42 References Xavier Carreras and Lluis Marquez. (2003). Phrase recognition by filtering and ranking with perceptrons. Proceedings of RANLP. Jes ú s Gim é nez and Llu í s M á rquez. (2004). SVMTool: A general POS tagger generator based on Support Vector Machines. Proceedings of LREC. Sunita Sarawagi and William W. Cohen. (2004). Semimarkov conditional random fields for information extraction. Proceedings of NIPS Yoshimasa Tsuruoka and Jun'ichi Tsujii. (2005). Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data. Proceedings of HLT/EMNLP. Yuka Tateisi,Yoshimasa Tsuruoka and Jun'ichi Tsujii. (2006). Subdomain adaptation of a POS tagger with a small corpus. In Proceedings of HLT- NAACL BioNLP Workshop. Daisuke Okanohara, Yusuke Miyao, Yoshimasa Tsuruoka, and Jun'ichi Tsujii. (2006). Improving the Scalability of Semi-Markov Conditional Random Fields for Named Entity Recognition. Proceedings of COLING/ACL 2006.


Download ppt "Part-of-speech tagging and chunking with log-linear models University of Manchester Yoshimasa Tsuruoka."

Similar presentations


Ads by Google