A Statistical Model for Domain- Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost.

A Statistical Model for Domain- Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost

Introduction Algorithm find maximum-probability segmentation using a statistical method. No training required. Domain-independent.

Other Methods Lexical Cohesion Statistical –Hidden Markov model (Yamron et al., 1998)

Statistical Model Find the probability of a segmentation S given a text W. Use Bayes rule to find maximum-probability segmentation.

Definition of Pr( W | S ) Assume statistical independence of topics and of words within the scope of a topic. Assume different topics have different word distributions. Can breakdown into double product of probabilities across words and segments. Uses Laplace estimator for word frequency prediction.

Definition of Pr( S ) Varies depending on prior information. In general, assume no prior information. Prevents the algorithm from generating too many segments; counteracts Pr( W | S ).

Algorithm Convert the probability function into a cost function by taking the negative log. Given a text W, define g i to be the gap between word w i and w i+1. Create a directed graph where the nodes are the gaps between words and the edges cover a segment between the gaps the edge connects. Calculate all edge weights by using the cost function and find the minimum-cost path from the first to last node.

Algorithm The calculated path represents the minimum- cost segmentation by correlating the edges to segments.

Algorithm – Features Determines the number of segments, but can also specify the number of edges in the shortest path. Can specify where segmentation occurs by only using a subset of all possible edges where both nodes connected by the edge meet user-specified conditions. Algorithm is insensitive to text length. –Good for summarization

Algorithm – Evaluation Compared algorithm against C99 (Choi 2000). Artificial test corpus extracted from the Brown corpus used. Probabilistic error metric used to evaluate performance. Results of Utiyama algorithm significantly better at 1% level than Choi algorithm.

Algorithm – Evaluation Assessment of algorithm using real texts is needed. Advantages over HMM –No training required (implies domain- independence). –Can incorporate probabilistic information into model. Might be expandable to detect word descriptions in text.

A Statistical Model for Domain- Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost.

Similar presentations

Presentation on theme: "A Statistical Model for Domain- Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Statistical Model for Domain- Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost.

Similar presentations

Presentation on theme: "A Statistical Model for Domain- Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost."— Presentation transcript:

Similar presentations

About project

Feedback