Presentation on theme: "Segmentation in NLP: A quick survey of techniques and applications Tony Davis (3M Health Information Systems) DC NLP meetup 13 August 2014."— Presentation transcript:
Segmentation in NLP: A quick survey of techniques and applications Tony Davis (3M Health Information Systems) DC NLP meetup 13 August 2014
What is segmentation? Dividing a document, broadly defined as a sequential information-bearing structure, into parts that have some internal coherence – Well known examples: book chapters, episodes of TV miniseries (segments are contiguous, strictly ordered, disjoint, and exhaustive) – More examples: sections and subsections of a textbook, topically-based segments defined by an index, plays in a sports broadcast (these depart from the prototype; they’re hierarchical, overlapping, “fuzzy”) – Notice that these examples are more or less “preset”, at the time the document is created, but ad hoc segments can be useful, too – Labeling the segments with good terms is an additional aspect of segmentation Examples: chapter titles, section headers in a clinical report, descriptions of plays in a football game
Why is segmentation useful? Segmentation guides and improves the user experience Segmentation affects meaning
What are we trying to do with automatic segmentation? Create “good”/”useful” segments, label them appropriately, and relate them to one another What makes a good segment? – Highly dependent on type of content (baseball game, cooking show, feature-length movie, talk show) – Thus, the mix of techniques for automatic segmentation will vary Relating segments to one another? – Topical similarity: the same people speaking about the same things, similar players or similar kinds of plays, common themes, … – Continuity of narrative (subplots in a TV comedy series, continuation of a project on home repair show)
Topic-based segmentation Topic segmentation: dividing a document into topically coherent segments – Possibly a partition of the document (exhaustive, non-overlapping segments) – But ad hoc segments may be more useful in some circumstances – Labeling the segments with good terms is often treated separately Several approaches used: – Discourse features Some signal a topic shift; others a continuation Highly domain-specific – Similarity measures between adjacent blocks of text Use document similarity measures, as in TextTiling (Hearst, 1994) or Choi’s algorithm (Choi, 2000) Posit boundaries at points where similarity is low – Lexical chains: repeated occurrences of a term (or of closely related terms) Again, posit boundaries where cohesion is low (few lexical chains cross the boundary (e.g., Galley, et al., 2003)
Topic segmentation – text similarity Choi’s algorithm (Choi, 2000) – Construct a similarity matrix of all sentence pairs – “Sharpen” the distinctions by replacing similarity values with their ranks in the local region – Coherent regions appear as brighter square areas along the diagonal; segment boundaries lie at their corners Freddy Y.Y. Choi. 2000. Advances in domain independent linear text segmentation. Proceedings of NAACL 2000, 109-117.
Topic segmentation – lexical chains LCSeg (Galley, et al., 2003) – Based on lexical chains: repeated occurrences of a term in a document – Each chain receives a score based on the number of terms within it, and its length (shorter chains get a higher score) – Cohesion at each potential boundary is measured by comparing scores of chains spanning that boundary and chains on either side near that boundary – Segments are postulated where cohesion is low Michel Galley, Kathleen McKeown, Eric Fosler-Lussier, and Hongyan Jing. 2003. Discourse Segmentation of Multi-Party Conversation. Proceedings of the 41st Annual Meeting of the ACL, 562-569
Topic segmentation – modeling term influence and relatedness Model both the influence of a term beyond the sentence it occurs in and semantic relatedness among terms – The range of a term’s influence extends beyond the sentence it occurs in, but how far? (relevance intervals) – Semantic relatedness among terms (contextually mediated graphs) Apply this model to topic-based segmentation And now, a short diversion about streaming media… Geetu Ambwani and Anthony R. Davis. 2010. Contextually–Mediated Semantic Similarity Graphs for Topic Segmentation. Proceedings of the Workshop on Graph=based Methods in NLP, 48th Annual Meeting of the ACL.
Automatic segmentation of streaming media Streaming media presents particular difficulties – Its timed nature makes navigation cumbersome, so a useful system should extract relevant intervals of documents rather than present a list of documents to the user – Speech recognition is error-prone, so the system must compensate for noisy input To address these issues, use a mix of statistical and symbolic NLP – Create relevant intervals for each useful term, using term co-occurrence information (the “COW model”) – Term co-occurrence measured using pointwise mutual information (PMI) within a corpus PMI(X,Y) = P(X,Y)/P(X)P(Y)
“Original” StreamSage system architecture media archive co-occurring word (COW) model large text corpus speech recognition SR output topic segmentation WSD anaphor resolution query expansion relevance interval calculation virtual document collection compound interval calculation user interfaces for applications
Relevance Intervals (ad hoc segments) Each RI is a contiguous segment of audio/video deemed relevant to a term RIs are calculated for all content words (after lemmatization) and multi-word expressions RI basis: sentence containing the term Each RI is expanded forward and backward to capture relevant material, using the techniques including: 1.Topic boundary detection by changes in COW values across sentences 2.Topic boundary detection via discourse markers 3.Synonym-based query expansion 4.Anaphor resolution Nearby RIs for the same term are merged Each RI is assigned a magnitude, reflecting its likely importance to a user searching on that term, based on the number of occurrences of the term in the RI, and the COW values of other words in the RI with the term
Relevance Intervals: an Example Index term: squatter Paul Bew is professor of Irish politics at Queens University in Belfast. In South Africa the government is struggling to contain a growing demand for land from its black citizens. Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg. In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property. Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days. NPR’s Kenneth Walker has a report. Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa. “Must give us a place…” We build an RI for squatter around each of these sentences…
Relevance Intervals: an Example Index term: squatter Paul Bew is professor of Irish politics at Queens University in Belfast. In South Africa the government is struggling to contain a growing demand for land from its black citizens. [cow-expand] Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg. In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property. [cow-expand] Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days. NPR’s Kenneth Walker has a report. Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa. “Must give us a place…” [topic segment boundary] We build an RI for squatter around each of these sentences…
Relevance Intervals: an Example Index term: squatter Paul Bew is professor of Irish politics at Queens University in Belfast. [topic segment boundary] In South Africa the government is struggling to contain a growing demand for land from its black citizens. [cow-expand] Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg. In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property. [cow-expand] Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days. [merge nearby intervals] NPR’s Kenneth Walker has a report. [merge nearby intervals] Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa. “Must give us a place…” [topic segment boundary] Two occurrences of squatter produce a complete merged interval.
Merging Ris for multiple terms Iran Russia Occurrence of Original Term Note that this can only be done at query time, So it needs to be fairly quick and simple.
Merging RIs Iran Russia Activation Spreading Occurrence of Original Term
Merging RIs Iran Russia Russia and Iran Occurrence of Original Term
Evaluating RI Quality Evaluation of retrieval effectiveness in timed media raises further issues: – Building a gold-standard is painstaking, and potentially quite subjective – It’s necessary to measure how closely the system’s RIs match the gold standard’s – What’s a reasonable baseline? We created a gold standard of about 2300 RIs with about 200 queries on about 50 documents (NPR, CNN, ABC, and business webcasts), and rated each RI on a scale of 1 (highly relevant) to 3 (marginally relevant). Testing performed on speech recognizer output
Evaluating RI Quality Measure amounts of extraneous and missed content missed by system ideal RI system RI extraneous In system
Evaluating RI Quality Comparison of percentages of median extraneous and missed content over all queries between system using COWs and system using only sentences with query terms present relevancemost relevant relevantmost relevant and relevant marginally relevant system extramissextramissextramissextramiss with COWs 9.312.739.227.529.921.661.715.0 without COWs 0.057.70.364.70.063.418.860.2
How to spot shifts in topic Yesterday, I took my dog to the park. While there, I took him off the leash to get some exercise. After 2 minutes, Spot began chasing a squirrel. ______________(Topic Shift)________________ I realized I needed to go grocery shopping. So I went off to Trader Joe’s. Unfortunately, they were out of cashews.
Calculating connection strengths for edges Construct a graph in which each node represents a term and a sentence, iff the sentence is contained in an RI for that term
Calculating connection strengths for edges Label each sentence with the terms that have Ris including that sentence; think of these as nodes in a graph
Calculating connection strengths for edges Now we’ll connect these nodes to one another, with edges that have weights
A real example (from CNN) S_190 We've got to get this addressed and hold down health care costs. S_191 Senator Ron Wyden the optimist from Oregon, we appreciate your time tonight. S_192 Thank you. ______________(Topic Shift)______________ S_193 Coming up, the final day of free health clinic in Kansas City, Missouri.
Graph representation of documents After pruning low-weight edges, regions near a boundary tend to be sparse
Data and parameter settings Two data sets – Concatenated New York Times articles 186 pseudodocuments, each containing 20 articles – Closed-captions for 13 TV shows News, talk shows, documentaries, lifestyle shows “Noisier” than news articles Parameter settings – Edge-pruning threshold: remove all edges with a connection strength below a set threshold (we tried a couple and usually used 0.5) – “Normalized novelty”: on the two sides of a potential boundary, the number of nodes labeled with the same terms, normalized by the total number of terms – Threshold (given normalized novelty): posit a boundary between sentences where normalized novelty falls below a set threshold (we typically used 0.6)
Systems compared C99Implementation of Choi (2000), from MorphAdorner* SNOur system, using a single node for each term occurrence (no extension) FEOur system, using an extension of a fixed number of sentences for each term from the sentence it occurs in SSOur system, using Ris without “hard” boundaries determined by the modified Choi algorithm SS+COur full segmentation system, incorporating “hard” boundaries determined by the modified Choi algorithm * morphadorner.northwestern.edu/morphadorner/-textsegmenter
Evaluation metrics How well does the hypothesized set of boundaries match the true (reference) set? P k (Beeferman, et al. 1997) and WindowDiff (Pevzner & Hearst, 2002) – Both compare hypothesis to reference segmentation within a sliding window – P k is the proportion of windows in which hypothesis and reference disagree on the number of boundaries – WindowDiff tallies the difference in the number of boundaries in each window – Both commonly used instead of precision and recall, because they take approximate matching into account – They have drawbacks of their own, however Doug Beeferman, Adam Berger, and John Lafferty. 1997. Text Segmentation Using Exponential Models. Proceedings of EMNLP 2 Lev Pevzner and Marti A. Hearst. 2002. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 28:1
Evaluation metrics P k and WindowDiff in action: comparing two segmentations
Evaluation metrics P k and WindowDiff: sliding window is half the average reference segment size
Evaluation metrics P k and WindowDiff in action: window slides one sentence forward
Evaluation metrics P k and WindowDiff in action: still no disagreement between the two segmentations within the window
Evaluation metrics P k and WindowDiff in action: one black mark against the hypothesis segmentation, which posits a boundary where the reference doesn’t
Evaluation metrics P k and WindowDiff in action: the same mistake counts again against the hypothesis (note that mistakes closer to reference boundaries appear in fewer windows, and are therefore penalized less)
Results on pseudodocuments systemprecisionrecallFPkWindowDiff C99 0.4040.5690.4670.3380.360 SN 0.0960.1120.0990.5700.702 FE 0.2650.1400.1760.4780.536 SS 0.5660.3830.4480.2920.317 SS+C 0.5780.5350.5370.2620.283 185 documents, containing 20 NYT articles each Number of boundaries not specified to systems
Results on pseudodocuments 86 documents out of the 186, for which C99 and SS+C both find at least 20 boundaries; top 20 boundaries selected systemprec = rec = FPkWindowDiff C99 0.5300.2220.231 SS+C 0.6430.1920.201
TV show closed-captions: inter- annotator agreement on segmentation Annotators note major and minor boundaries, using 1-5 rating scale As expected, IAA is low, so we create a reference annotation
TV show closed-captions: inter- annotator agreement on segmentation P k values between pairs of annotators: all boundaries and major boundaries
TV show closed-captions: segmentation Accuracy is low, which is unsurprising given the low IAA systemprecisionrecallFPkWindowDiff All topic boundaries C99 0.1970.1860.1840.4760.507 SS+C 0.3150.2080.2400.4210.462 Major topic boundaries only C99 0.1700.2960.2010.6370.812 SS+C 0.2710.3160.2710.4630.621
Using segment-level metadata for sports broadcast segmentation Play-by-play data for sports – Available for several sports in near real-time – Particularly useful for sports that are inherently segmented into discrete plays – Advantages of consistency and quality Many other types of segment level metadata, depending on type of content – DVD chapters – Advertising breaks in TV shows – Screenplays
Segment-level metadata for sports Closed captions can be problematic – Sometimes describe clearly what’s going on… >> Kenny: THEY START FROM THEIR OWN 35. AND THIS IS GRANT WHO FUMBLED MOMENTS AGO. HAD A TERRIFIC HALF OF THE SEASON Here is data for this play from Stats, Inc. (near real-time):
Segment-level metadata for sports Alignment of this type of metadata with the video is fairly easily achieved, because the game clock is displayed on the video Stats, Inc. data for play, with added video times: SEA 714:33 Q1 GB 0DN 1 YD 10
Segmenting clinical documents Clinical documents are divided into sections, but they can be messy, and the section boundaries aren’t always clear Segmentation affects semantics; family medical history is different from patient’s current diagnosis Medical coding (and ultimately reimbursement) depends on accurately discerning the structure of clinical reports Thus, automated medical coding requires, among other things, good segmentation (“regioning”)
Machine learning for clinical document segmentation Features used include: – Common section header terms – Formatting and punctuation (e.g., all caps, colon at end of line) – Semantic type of terms within a sentence (e.g., body parts, medications, procedures, findings) For each sentence, generate a label: – new_region or continuing_region – Region use labels can be hypothesized at the same time, or added later Results are close to human levels of accuracy on many kinds of documents
Summing up Segmentation problems are ubiquitous in NLP Motivations are both user and machine driven Many kinds of information to exploit, depending on the data and the application Resources available to get started