Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Statistical Model for Domain- Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost.

Similar presentations

Presentation on theme: "A Statistical Model for Domain- Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost."— Presentation transcript:

1 A Statistical Model for Domain- Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost

2 Introduction Algorithm find maximum-probability segmentation using a statistical method. No training required. Domain-independent.

3 Other Methods Lexical Cohesion Statistical –Hidden Markov model (Yamron et al., 1998)

4 Statistical Model Find the probability of a segmentation S given a text W. Use Bayes rule to find maximum-probability segmentation.

5 Definition of Pr( W | S ) Assume statistical independence of topics and of words within the scope of a topic. Assume different topics have different word distributions. Can breakdown into double product of probabilities across words and segments. Uses Laplace estimator for word frequency prediction.

6 Definition of Pr( S ) Varies depending on prior information. In general, assume no prior information. Prevents the algorithm from generating too many segments; counteracts Pr( W | S ).

7 Algorithm Convert the probability function into a cost function by taking the negative log. Given a text W, define g i to be the gap between word w i and w i+1. Create a directed graph where the nodes are the gaps between words and the edges cover a segment between the gaps the edge connects. Calculate all edge weights by using the cost function and find the minimum-cost path from the first to last node.

8 Algorithm The calculated path represents the minimum- cost segmentation by correlating the edges to segments.

9 Algorithm – Features Determines the number of segments, but can also specify the number of edges in the shortest path. Can specify where segmentation occurs by only using a subset of all possible edges where both nodes connected by the edge meet user-specified conditions. Algorithm is insensitive to text length. –Good for summarization

10 Algorithm – Evaluation Compared algorithm against C99 (Choi 2000). Artificial test corpus extracted from the Brown corpus used. Probabilistic error metric used to evaluate performance. Results of Utiyama algorithm significantly better at 1% level than Choi algorithm.

11 Algorithm – Evaluation Assessment of algorithm using real texts is needed. Advantages over HMM –No training required (implies domain- independence). –Can incorporate probabilistic information into model. Might be expandable to detect word descriptions in text.

Download ppt "A Statistical Model for Domain- Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost."

Similar presentations

Ads by Google