2 Shallow Parsing Break text up into non-overlapping contiguous subsets of tokens. Also called chunking, partial parsing, light parsing. What is it useful for? Entity recognition – people, locations, organizations Studying linguistic patterns – gave NP – gave up NP in NP – gave NP NP – gave NP to NP Can ignore complex structure when not relevant
3 A Relationship between Segmenting and Labeling Tokenization segments the text Tagging labels the text Shallow parsing does both simultaneously.
4 Chunking vs. Full Syntactic Parsing “ G.K. Chesterton, author of The Man who was Thursday”
5 Representations for Chunks IOB tags Inside, outside, and begin Why do we need a begin tag?
6 Representations for Chunks Trees Chunk structure is a two-level tree that spans the entire text, containing both chunks and non-chunks
7 CONLL Collection From the Conference on Natural Language Learning Competition from 2000 Goal: create machine learning methods to improve on the chunking task
8 CONLL Collection Data in IOB format from WSJ: Word POS-tag IOB-tag Training set: 8936 sentences Test set: 2012 sentences Tags from the Brill tagger Penn Treebank Tags Evaluation measure: F-score 2*precision*recall / (recall+precision) Baseline was: select the chunk tag that is most frequently associated with the POS tag, F =77.07 Best score in the contest was F=94.13
9 nltk_lite and CONLL2000 Note that raw hides the IOB format
14 Chunking with Regular Expressions This time we write regex’s over TAGS rather than words ? + Compile them with parse.ChunkRule() rule = parse.ChunkRule(‘ +’) chunkparser = parse.RegexpChunk([rule], chunk_node = ‘NP’) Resulting object is of type Tree Top-level node called S Can change this label if you want, in third argument to RegexpChunk