Presentation is loading. Please wait.

Presentation is loading. Please wait.

Survey of segmentation method Haei-Ming Chu 2004/0817.

Similar presentations


Presentation on theme: "Survey of segmentation method Haei-Ming Chu 2004/0817."— Presentation transcript:

1 Survey of segmentation method Haei-Ming Chu 2004/0817

2 2 Reference Xiang Ji Hongyuan Zha, Domain-independent text segmentation using anisotropic diffusion and dynamic programming, “ Proceeding of ACM SIGIR,2003 ”. David M. Blei and Pedro J. Moreno, Topic Segmentation with an Aspect Hidden Markov Model, " Proceeding of ACM SIGIR, 2001 ”.

3 3 Reference Evgeny Matusov, Jochen Peters, Carsten Meyer1 and Hermann Ney TOPIC SEGMENTATION USING MARKOV MODELS ON SECTION LEVEL “Proceeding of IEEE ASUR 2003” 方國安 Story Segmentation and classification of Chinese broadcast news using Genetic Algorithm NCKU 2004

4 4 Introduction Dynamic programming based HMM based Genetic Algorithm based

5 5 Dynamic programming based Xiang Ji, Hongyuan Zha, Department of Computer Science and Engineering The Pennsylvania State University University Park, PA Domain-independent text segmentation using anisotropic diffusion and dynamic programming, “ Proceeding of ACM SIGIR,2003 ”.

6 6 Domain-independent text segmentation(1/4) Document-dependent stop words Useful in discriminating among several different documents but are rather harmful in detection subtopics in a document. Example: “ heart ” in the heart diseases articles

7 7 Domain-independent text segmentation(2/4) Sentence distance matrix

8 8 Domain-independent text segmentation(3/4) Algorithm of selecting document-dependent stop words

9 9 Domain-independent text segmentation(4/4)

10 10 Anisotropic diffusion and dynamic programming (1/6) Anisotropic Diffusion People sometimes fail to describe a theme with correct vocabulary or cannot arrange lexical information into a coherent semantic substance for some reasons Proposed by Perona and Malik Goal to reduce noise in homogeneous regions of an image, making homogeneous regions even more homogeneous, sharpen boundaries between homogeneous regions

11 11 Anisotropic diffusion and dynamic programming (2/6) Diffusion coefficient function the evolution is governed by the following equations

12 12 Anisotropic diffusion and dynamic programming (3/6)

13 13 Anisotropic diffusion and dynamic programming (4/6) Segmentation by Dynamic Programming

14 14 Anisotropic diffusion and dynamic programming (5/6)

15 15 Anisotropic diffusion and dynamic programming (6/6)

16 16 Experiment

17 17 Experiment

18 18 HMM based (1) David M. Blei University of California, Berkeley Dept. of Computer Science 495 Soda Hall Berkeley, CA, 94720, USA Topic Segmentation with an Aspect Hidden Markov Model, " Proceeding of ACM SIGIR, 2001 ”. 陳舜全 Initial Studies on Chinese Spoken Document Analysis Topic Segmentation, Title Generation and Topic Organization NTU 2004 Pedro J. Moreno, Compaq Computer Corporation Cambridge Research Laboratory One Cambridge Center Cambridge, MA, 02142, USA

19 19 Aspect HMM Segmentation (1/9) Independence assumption The occurrence of a document and a word are independent of each other given a topic or factor P(w|z) is the language model conditioned on the hidden factor P(d|z) is the probability distribution over the training segment labels P(z) is the prior distribution on the hidden factor

20 20 Aspect HMM Segmentation (2/9) Use the Expectation and Maximization (EM) algorithm to fit the parameters from an uncategorized corpus E-step : compute the posterior probability of the hidden variable given our current model.

21 21 Aspect HMM Segmentation (3/9) M-step : maximize the log likelihood of the training data with respect to the parameters

22 22 Aspect HMM Segmentation (4/9)

23 23 Aspect HMM Segmentation (5/9)

24 24 Aspect HMM Segmentation (6/9)

25 25 Aspect HMM Segmentation (7/9)

26 26 Experiment(1/4) Corpus SPEECHBOT transcripts from All Things Considered (ATC) –A daily news program on National Public Radio –317 shows from August 1998 through December 1999 –4919 segments with a vocabulary of 35,777 unique terms –About 4 million words the word error rate is about 20% to 50% range 3830 articles from New York Times (NYT) error-free text –About 4 million words with a vocabulary of 70,792 unique terms The aspect model with 20 hidden factors

27 27 Experiment(2/4) Evaluation Co-occurrence agreement probability (CoAP) D( i, j ) is a probability distribution over the distances between words in a document The delta functions are 1 if the two words fall in the same segment and 0 otherwise D( i, j )=1 if the words are k words apart and 0 otherwise K is the half the average length of a segment in the training corpus, 170 in the ATC corpus, and 200 in the NYT corpus

28 28 Experiment(3/4)

29 29 Experiment(4/4)

30 30 HMM based (2) Evgeny Matusov 1,2, Jochen Peters 1, Carsten Meyer1 1 and Hermann Ney 2 1 Philips Research Laboratories Weisshausstr. 2, D-52066 Aachen, Germany 2 Lehrstuhl f¨ur Informatik VI RWTH Aachen – University of Technology D-52066 Aachen, Germany 2 TOPIC SEGMENTATION USING MARKOV MODELS ON SECTION LEVEL “Proceeding of IEEE ASUR 2003”

31 31 Introduction Focus on the documents following a typical structure regarding the sequence and organization of the individual sections Assume significant correlations in the order of the observed topic sequences, in the section lengths belonging to the same topic, and in typical section start or end phrases. Like medical reports, scientific articles, meeting protocols, legal documents etc. Explicitly exploit these typical document structures as additional knowledge source, by applying a generative approach using Markov models on the level of complete sections

32 32 Markov models on section level Length modeling Characteristic phrases near the section boundaries The sections are now specified not only by their topic but also by their size and location in the document

33 33 Theory (1/4) Find an optimal segmentation of the given word stream Into K sections (K is to be optimized) which labeled by the topics And characterized by the section end positions (word indices)

34 34 Theory (2/4) Interested in Using Bayes rule and dropping the constant prior probability

35 35 Theory (3/4) The first term Pr(t 1 K,n 1 K,K) is decomposed into Pr(t 1 K, K) * Pr(n 1 K |t 1 K ) Instead of explicitly modeling the distribution Pr(K), we append a fictitious document end topic t K+1 =t end Pr(t 1 K, K) = Pr(t 1 K+1 ) is approximated by a product of topic transition probabilities p(t k |t k-1 ) Pr(n 1 K |t 1 K ) of the section end positions is decomposed into a product of topic-dependent probabilities p(∆n k |t k ) of the section lengths ∆n k :=n k -n k-1

36 36 Theory (4/4) By the modeling assumptions we get the following optimization criterion This problem can be solved using dynamic programming in which we perform a two-dimensional simultaneous optimization over the section boundaries and over the topics

37 37 Experiment

38 38 Genetic Algorithm based 方國安 Story Segmentation and classification of Chinese broadcast news using Genetic Algorithm NCKU 2004

39 39 Introduction Genetic Algorithm – 染色體的定義 – 族群初始化 – 評估函數 – 基因運算子 – 引數

40 40 染色體的定義

41 41 族群初始化

42 42 評估函數

43 43 評估函數

44 44 基因運算子 交配: one-point crossover 突變: random assign

45 45 引數 族群大小: 100 世代數: 500 交配機率: 0.25 突變機率: 0.01

46 46 實驗 (1/2)

47 47 實驗 (2/2)

48 48 Thanks for All


Download ppt "Survey of segmentation method Haei-Ming Chu 2004/0817."

Similar presentations


Ads by Google