Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Summarization of News using WordNet Concept Graphs

Similar presentations


Presentation on theme: "Automatic Summarization of News using WordNet Concept Graphs"— Presentation transcript:

0 Automatic Summarization of News using WordNet Concept Graphs
Laura Plaza, Alberto Díaz, Pablo Gervás Departamento de Ingeniería del Software e Inteligencia Artificial Universidad Complutense de Madrid

1 Automatic Summarization of News using WordNet Concept Graphs
Introduction Digital newspapers have increased explosively In order to tackle this information overload, text summarization can play a role More linguistic and ontological resources that can benefit NLP systems INTRODUCTION PREVIOUS WORK WORDNET SUMMARIZATION METHOD EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK We present an ontology-based method for automatic summarization of news items based on mapping the text to concepts in WordNet and representing the document and its sentences as concept graphs The motivation of this work is quite simple and straight forward: During the last years, the comunity of electronic newspaper readers has considerably increased, but still most users only read the headlines and leads sections because of the lack of time What we will look in the next slides is how to build automatically summaries of news items that allow users to get a proper idea of what an article is about in just a few lines The method proposed takes advantages of linguistic resources and ontologies. In particular, it is based on the mapping of the text to concepts in the WordNet lexical database and the representation of the document and its sentences as concept graphs. IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

2 Automatic Summarization of News using WordNet Concept Graphs
Related Work Text summarization extractive methods Segments of text (usually sentences) that contain the most relevant information are extracted Classic methods compute the score of a sentence using a linear combination of the weights that result from a set of different features positional, linguistic, statistical Graph-based methods generally represent the sentences by its tf*idf vectors and compute sentence connectivity using the cosine similarity INTRODUCTION RELATED WORK WORDNET SUMMARIZATION METHOD EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK Let´s begin the presentation by looking at the previous work in the area We can talk of 2 generic groups of summarization methods: 1. Those which generate extracts or extractive methods 2. Those which generate abstracts or abstractive methods We focus on the first group. Extractive methods construct summaries by selecting those segment in the text (usually sentences) that contain the most relevant information Early methods (dating from the 50s-60s) compute the score of a sentence using simple heuristic features such as: 1. The position of the sentences in the document 2. The frequency of the words that it contains 3. The presence of certain cue words or indicative phrases Recently, graph-based methods have became popular, and most approaches represent the document as a graph, where nodes are sentences (represented by their term frequency vectors) and edges are labeled with the cosine similarity between sentences IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

3 Automatic Summarization of News using WordNet Concept Graphs
Related Work Most approaches do not capture the semantic relations between terms (synonymy, hypernymy, homonymy, co-occurrence …) INTRODUCTION RELATED WORK WORDNET SUMMARIZATION METHOD EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK “Hurricanes are useful to the climate machine. Their primary role is to transport heat from the lower to the upper atmosphere,'' he said. He explained that cyclones are part of the atmospheric circulation mechanism, as they move heat from the superior to the inferior atmosphere. These methods based on terms exhibit a main problem: they do not capture the semantic of these terms and the relations between them Let`s see an example that illustrates the situation: The sentences 1 and 2 have exactly the same meaning, but as they use a different vocabulary, methods based on terms do not recognize this equivalence Concept graph-based approaches, as the one presented here, appear to attempt to solve this problem same meaning!!! Concept graph-based approach IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

4 Automatic Summarization of News using WordNet Concept Graphs
Related Work Graph-based document clustering for extractive multi-document summarization (CSUGAR, Yoo et al ) The entire collection of documents is represented as an extended graph of MeSH concepts This graph is clustered using an algorithm based on the Scale-Free Network Theory that generates, for each cluster: a graph and a set of vertices as centroids (Hub Vertices Set, HVS) Each document is assigned to a cluster using a vote mechanism based on whether the vertices in the document belong to HVS or non-HVS All documents assigned to a same clusters contribute to produce a single summary INTRODUCTION RELATED WORK WORDNET SUMMARIZATION METHOD EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK The method is inspired in the work by Yoo et al on document clustering of biomedical scientific documents, which they after apply to multidocument summarization. In a few words, First, the represents the entire collection of documents as an extended graph of the medical terminology MeSH concepts Next, this graph is clustered using an algorithm based on the Scale-Free Network Theory, that generate a number of clusters, each of one represented by a graph and a set of vertices as centroids (what they call Hub Vertices Set). In this algorithm, the initial number of Hub Vertices is an important parameter that must be set empirically. Finally, each document is assigned to a cluster, using a vote procedure to measure the similarity between the document and the cluster representations Once the documents have been assigned to clusters, all documents in the same clusters are supossed to deal with the same topic, and therefore are used to produce a single summary IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

5 Automatic Summarization of News using WordNet Concept Graphs
An electronic lexical database developed at the Princeton University (Miller et al, 1993) Words with the same meaning are grouped in a Synset Each synset has a gloss that defines it Synsets are connected via semantic relations: Hypernymy and Hyponymy: feline is a hypernym of cat cat is a hyponym of feline Holonymy and Meronymy: vehicle is a holonym of wheel wheel is a meronym of vehicle Troponymy: to lisp is a troponym of to talk Entailment: to sleep is entailed by to snore INTRODUCTION RELATED WORK WORDNET SUMMARIZATION METHOD EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK As I have already mentioned, our method is strongly based on WordNet, so I will give a short introduction to it. WordNet is an electronic lexical database developed at Princeton University. Wordnet structures the lexical information in terms of word meanings, so that words with the same meaning are grouped into a single synonym set, called synset. Each synset has a unique identifier and a gloss that defines it. Synsets are connected one to another via a number of semantic relations. These relations depend on the type of word, and include among others: Hypernyms or Hyponyms for nouns orTroponym and entailment, for verbs IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

6 Automatic Summarization of News using WordNet Concept Graphs
Summarization Method INTRODUCTION PREVIOUS WORK UMLS SUMMARIZATION METHOD EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK An extractive mono-document method that selects sentences based on their similarity with concept clusters Steps: Preprocessing Graph-based Document Representation Concept Clustering and Theme Recognition Sentence Selection A worked document example from the DUC collection will illustrate the method Now, let`s start with the main section, the discussion of the summarization method The algorithm involves the following steps: Preprocessing Document Representation Concept clustering and theme recognition Sentence selection I will illustrate the method by a simple example, using a news article from the Document Understanding Conferences 2002 IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

7 Automatic Summarization of News using WordNet Concept Graphs
Preprocessing INTRODUCTION PREVIOUS WORK UMLS SUMMARIZATION METHOD PREPROCESSING EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK Extracting the body of the news items Splitting the text into sentences using GATE ( Generic words and frequent terms are removed They are not useful in discriminating between relevant and non relevant sentences First step is Preprocessing.This preliminary step prepares the document for the following phases: First, the body of the news item is extracted Second, the text is splitted into sentences, using the sentence splitter module of the GATE architecture Third, generic words and very frequent terms are removed, as they will not useful in discriminating between relevant and non relevant sentences IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs 7

8 Automatic Summarization of News using WordNet Concept Graphs
Graph-based Document Representation Representing the news item body as a graph Where the vertices are the concepts in WordNet associated to the terms And the edges indicate the relations between them Steps Mapping between terms and concepts Graph-based Sentence Representation Graph-based Document Representation INTRODUCTION PREVIOUS WORK UMLS SUMMARIZATION METHOD PREPROCESSING GRAPH - BASED DOCUMENT REPRESENTATION EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK Second step is Document Representation. The final purpose of this step is to represent the news body as a graph: where vertices are the WordNet concepts associated to the terms in the text and the edges represent semantic relations between them The process can be seen as 3 different subprocess: IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

9 (word sense disambiguation)
Graph-based Document Representation Mapping between terms and concepts INTRODUCTION RELATED WORK WORDNET SUMMARIZATION METHOD PREPROCESSING GRAPH - BASED DOCUMENT REPRESENTATION EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK Hurricane Gilbert swept toward the Dominican Republic Sunday, and the Civil Defense alerted its heavily populated south coast to prepare for high winds, heavy rains and high seas. WordNet Sense Relate (word sense disambiguation) Concept WN Sense hurricane 1 defense 9 prepare 4 Gilbert 2 alert high sweep heavily wind Dominican Rep populate heavy sunday south rain civil coast sea First, the terms in each sentence are mapped to the appropriated concepts in WordNet This apparently trivial mapping entails a main difficulty: as a term can present different senses in WordNet, some kind of disambiguation must be done to determine the correct sense for each term. In order to solve this problem, we have used the WordNet Sense Relate Package, which includes several algorithms for Word Sense Disambiguation. In particular, we use the lesk metric, which measures the similarity between a word and its neighbors based on the amount of overlaps (in words) between the definitions of the two terms We can see in the slide how an example sentences has been translated into concepts and assigned the appropriated WordNet Sense. If we consider the concept defense, we see that it has been assigned the WN sense number 9. A good number of works that make a similar mapping of terms to WN concepts, do not make any disambiguation and simply take the first sense. In this case, the first sense of defense refers to the role of some players in a sport team; that is, nothing to do with the real sense in this sentence IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

10 Automatic Summarization of News using WordNet Concept Graphs
Graph-based Document Representation Graph-based Sentence Representation WordNet concepts derived from common nouns are extended with their hypernyms INTRODUCTION RELATED WORK WORDNET SUMMARIZATION METHOD PREPROCESSING GRAPH - BASED DOCUMENT REPRESENTATION EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK Starting from the concepts identified in the previous step, only those corresponding to common nouns are extended with their hypernyms in WordNet, building a hierarchical representation for each sentence The figure shows the graph for the example sentence. In this graph, edges are temporally unlabeled and represent “is_a” releations. Only a single vertex is created for each distinct concept. This means that if 2 different terms stand for the same concept, only one vertex is created that represents both terms. IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

11 WordNet Similarity Package
Graph-based Document Representation Graph-based Document Representation INTRODUCTION RELATED WORK WORDNET SUMMARIZATION METHOD PREPROCESSING GRAPH - BASED DOCUMENT REPRESENTATION EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK The sentence graphs are merged into a document graph Enriched with similarity relations between leaf concepts, when similarity > threshold Each edge is assigned a weight directly proportional to the depth of the concepts that it connects (Yoo et al. 2007) physical entity physical object location region territory territorial division country geo. formation shore coast thing body of water sea 1/2 2/3 3/4 4/5 5/6 6/7 Finally, all sentence graphs are merged into a single graph that represents the whole news. This graph can be extended with different and more specific semantic relations between concepts. In particular, we have tested a second relation apart from the hypernymy, which expresses the semantic relateness between concepts in the graph. To compute this similarity, we use the WordNet Similarity Package, a perl module which implements various semantic and similarity measures. Again, we have opted for the Lesk algorithm, which as i told before computes semantic relatedness of word senses using gloss overlaps. First, we calculate the similarity between every pair of leaf concepts in the graph. And second, we add a new edge between them if their similarity is greater than a threshold Finally, each edge is labeled with a weight which is directly proportional to the depth of the concepts that connects. That is to say, the more specific the concepts are, the more weight is assigned to them WordNet Similarity Package IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

12 Automatic Summarization of News using WordNet Concept Graphs
Concept Clustering and Theme Recognition The document graph is clustered in concept subgraphs, using a clustering algorithm similar to that used in CSUGAR (Yoo et al., 2007) a subgraph and a HVS are obtained for each cluster Each cluster is composed by a set of concepts that are closely related in meaning It can be seen as a theme in the document INTRODUCTION RELATED WORK WORDNET SUMMARIZATION METHOD PREPROCESSING GRAPH - BASED DOCUMENT REPRESENTATION CONCEPT CLUSTERING AND THEME RECOGNITION EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK Third step is Concept Clustering and Theme Recognition. The aim of this step is to cluster the WordNet concepts in order to construct sets of concepts that are closely related in meaning, using a clustering method similar to that used in CSUGAR I’m not explaining in detail how the clustering is carried out. Just mention that, as a result, we obtain a variable number of clusters, each of one is composed of a subgraph and a set of concepts, named “Hub Vertices Set” which is a set of vertices strongly related to one another and can be seen as the centroid of the cluster. In turn, each cluster can be seen as a theme in the document IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

13 Automatic Summarization of News using WordNet Concept Graphs
Sentence Selection Selecting significant sentences for the summary, based on the similarity between sentences and clusters The similarity between a cluster Ci and a sentence Oj is measured with a vote mechanism that is based on whether the vertices of the sentence graph belongs to HVS or non-HVS. simililarity (Ci, Oj) = INTRODUCTION RELATED WORK WORDNET SUMMARIZATION METHOD PREPROCESSING GRAPH - BASED DOCUMENT REPRESENTATION CONCEPT CLUSTERING & THEME RECOGNITION SENTENCE SELECTION EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK Last step is Sentence Selection.The goal of this step is to finally select the significant sentences for the summary, based on the similarity between the sentence graphs and the clusters. Thus it is necessary to define a similarity measure between a cluster and a sentence graph. Because of both representation are quite different in size, tradicional metrics like the edit distance are not appropiated. Instead, a vote mechanism, adapted from the work of Yoo is used, so that if a vertex in the sentence graph belong to a cluster, it gives 1 point if it belongs to the HVS concepts, and 0.5 points if it belongs to non- HVS concepts. This way, we get punctuation for each sentence to each cluster, and each sentence is assigned to that cluster to which it gives the higher score. IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

14 Automatic Summarization of News using WordNet Concept Graphs
Sentence Selection Three heuristics are tested Heuristic 1: for each cluster, the top sentences (in a number proportional to its size) are selected Heuristic 2: all sentences are selected from the cluster with more concepts Heuristic 3: a single score for each sentence is computed, as the sum of the similarity assigned to each cluster adjusted to their sizes, and the sentences with higher scores are selected INTRODUCTION RELATED WORK WORDNET SUMMARIZATION METHOD PREPROCESSING GRAPH - BASED DOCUMENT REPRESENTATION CONCEPT CLUSTERING & THEME RECOGNITION SENTENCE SELECTION EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK Finally, 3 heuristic have been investigated for the selection of the sentences for the summary: First heuristic selects the top sentences of each cluster, in a number proportional to the cluster size In the second heuristic we accept the hypothesis that the cluster with more concepts represents the main theme in the document, and select all sentences from this cluster. The third heuristic computes a single score for each sentence, as the sum of the score assigned to each cluster adjusted to their sizes, and selects the sentences with a higher global score IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

15 Automatic Summarization of News using WordNet Concept Graphs
Experimental Results Two separate groups of experiments Parameter setting Comparison with a positional lead baseline Evaluation Collection 10 news items from the DUC 2002 collection Each item comes with (at least) a model abstract written by a human Compression rate of 30% Evaluation Metric ROUGE recall measures: R-1, R-2, R-L, R-W-1.2 INTRODUCTION RELATED WORK WORDNET SUMMARIZATION METHOD EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK In order to test the method, we carried out two separated groups of experiments: the point of the first group is to determine the best values for the different parameters in the algorithm, while the second one compares these preliminary results with a lead baseline To perform the evaluation, we use 10 news items from the Document Understanding Conferences, and compare the summaries produces with our method to that written by humans that comes with the items As ROUGE (Lin, 2004) is the most common metric for automatic evaluation of summarization, we have computed some of the ROUGE recall metrics. ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation and includes several measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of over- lapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

16 Automatic Summarization of News using WordNet Concept Graphs
Experimental Results INTRODUCTION RELATED WORK WORDNET SUMMARIZATION METHOD EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK Algorithm parametrization Parameter 1: Which heuristic produces the best summaries? Parameter 2: What percentage of vertices should be considered as hub vertices (HV) by the clustering method? Parameter 3: Is it better to consider both types of relations between concepts (hypernymy and similarity) or just the hypernymy relation? Parameter 4: If the similarity relation is taken into account, what similarity threshold should be considered? So in order to set the algorithm parameters, we have to study the following question Parameter 1: Which heuristic produces the best summaries? Parameter 2: What percentage of vertices should be considered as hub vertices (HV) by the clustering method? Parameter 3: Is it better to consider both types of relations between concepts (hypernymy and similarity) or just the hypernymy relation? Parameter 4: If the similarity relation is taken into account, what similarity threshold should be considered? IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

17 Automatic Summarization of News using WordNet Concept Graphs
Experimental Results Parameters 1 & 2: Heuristic and Hub Vertices percentage INTRODUCTION RELATED WORK WORDNET SUMMARIZATION METHOD EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK Average R-1 Average R-2 Average R-L Average R-W-1.2 Heuristic 1 1- percent 0,69324 0,33723 0,65202 0,24847 2- percent 0,71474 0,34520 0,69115 0,26225 5- percent 0,67814 0,28308 0,63836 0,23646 10- percent 0,67201 0,27093 0,62717 0,22283 Heuristic 2 0,71446 0,33185 0,67367 0,25384 0,72487 0,34438 0,68040 0,25810 0,70358 0,30756 0,65924 0,24488 0,72449 0,32887 0,67966 0,25547 Heuristic 3 0,72056 0,34105 0,68058 0,25727 0,72755 0,68308 0,25886 0,70560 0,31164 0,66377 0,24612 0,71273 0,31980 0,66888 0,25162 So firsts experiments are directed to determine the best heuristic along with the percentage of hub vertices for the clustering method. For this experiments, only the hypernymy relation is used. As shown in the table, the third heuristic presents the best results when the number of hub vertices is set to 2%. Nonetheless, the differences between the three heuristics are not significant, so a priori, we cannot make a decision on the best heuristic IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

18 Automatic Summarization of News using WordNet Concept Graphs
Experimental Results Parameter 3: Semantic Relations Parameter 4: Similarity threshold INTRODUCTION RELATED WORK WORDNET SUMMARIZATION METHOD EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK Average R-1 Average R-2 Average R-L Average R-W-1.2 Heuristic 1 Hypernymy 0,71470 0,32736 0,67565 0,25250 Hyp. & Sim. 0,72736 0,34230 0,68558 0,25681 Heuristic 2 0,72487 0,34438 0,68040 0,25810 0,72920 0,33949 0,68463 0,25664 Heuristic 3 0,72755 0,68308 0,25886 0,73118 0,32941 0,67838 0,25323 For the second experiment, the percentage of hub vertices is set to 2% and the similarity threshold is temporally set to 0.2. We can see how using both relations together (hypernymy and similarity) improves summary evaluation. Finally, in the third experiment the aim is to fix the similarity threshold for the WordNet Similarity algorithm. According to the results in the table, value 0.2 reports the best outcome. Average R-1 Average R-2 Average R-L Average R-W-1.2 0.01 0,71145 0,31381 0,66314 0,24872 0.05 0,71470 0,32736 0,67565 0,25250 0.1 0,71953 0,32573 0,67578 0,25477 0.2 0,73118 0,32941 0,67838 0,25323 0.5 0,71058 0,31786 0,66690 0,24892 IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

19 Automatic Summarization of News using WordNet Concept Graphs
Experimental Results Best Parametrization Comparison with a positional lead baseline INTRODUCTION RELATED WORK WORDNET SUMMARIZATION METHOD EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK Heuristic: Undetermined Hub Vertices Percentage: 2% Relations : Hypernymy and similarity Similarity threshold: 0.2 Finally, with the parametrization previously established, we evaluate the method against a lead baseline summary constructed by selecting only the first sentences in the news items. Our method gets around a 73 % of average ROUGE 1 score, while the lead baseline gets around 59%. This is a significant difference, but even more , it must be taken into account that the positional heuristic used in the baseline is itself a quite pertinent heuristic for the summarization of news articles, where the most important information is usually concentrated in the one or two first sentences. Nonetheless, when dealing with very long documents (as scientific papers) this heuristic is not that appropriate and the difference with sophisticated methods becomes more evident. Average R-1 Average R-2 Average R-L Average R-W-1.2 Best Configuration 0,73118 0,32941 0,67838 0,25323 Lead Baseline 0,59436 0,18826 0,55522 0,20488 IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

20 Automatic Summarization of News using WordNet Concept Graphs
Conclusions Conclusions Representing the text using ontology concepts allows a richer representation than the one provided by a vector space model The method proposed can be applied to documents from different domains Previously tested on biomedical scientific literature Preliminary evaluation There are not significant differences between the heuristics Results are clearly better than the baseline INTRODUCTION RELATED WORK WORDNET SUMMARIZATION METHOD EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK So to sum up, I have introduced a method for summarizing text documents. Even if the method is domain-independent, it has been applied to a specific type of documents: news. We represent the document as an ontology-enriched graph, using WordNet concepts and relations. This way we get a richer representation than the one provided by traditional models based on terms which results in a considerable improvement of the evaluation and quality of the resulting summaries. Another important contribution of the method proposed is the possibility of applying it to documents from different domains. It has been specially designed to work in any domain with minor changes, as it only requires modifying the ontology and the disambiguation algorithm. The authors have previously tested this method in a completely different domain: automatic summarization of biomedical scientific literature Finally, a preliminary evaluation has been accomplished, which has not permitted to decide which heuristic is the best one. We have also compared these results with a positional baseline, and the results seem to be quite promising IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

21 Automatic Summarization of News using WordNet Concept Graphs
Conclusions Future Work Long sentences have a higher probability of being selected Normalizing the sentence scores overthe number of concepts Performing a large-scale evaluation on the DUC 2002 Corpus (ROUGE) Comparing with similar systems (i.e. LexRank) Extending the method to multidocument summarization INTRODUCTION RELATED WORK WORDNET SUMMARIZATION METHOD EXPERIMENTAL RESULTS CONCLUSIONS FUTURE WORK Nonetheless, we have identified several problems and some possible improvements. First, as our method extracts whole sentences, long ones have higher probability of being selected, because they contain more concepts. The alternative could be to normalize the sentences scores by the number of concepts. Second, in order to formally evaluate the method, a large-scale evaluation is under way on the 566 news articles from the DUC 2002. As future work, we plan to compare these results with those reported by similar systems (i.e. LexRank). This will allow us to determine if significant statistical differences exist between the algorithm presented in this paper and the most accepted methods in the area. Finally, we are working in an extension of the method to accomplish multi-document summarization, which will permit the generation of a single summary from a set of documents regarding the same topic. IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

22 Thank you! So, thank you very much for your attention

23 Automatic Summarization of News using WordNet Concept Graphs
Conclusions Clustering Algorithm We compute the salience of all vertices in the document graph The n vertices with higher salience (Hub Vertices) are iteratively grouped in Hub Vertex Sets (centroids) The remaining vertices (non HVS) are assigned to that the cluster to which they are more connected This process repeats iteratively, adjusting both HVS and the remaining vertices progressively We hypothesize that the document graph is an instance of a scale-free network (Barabasi and Albert, 1999). These networks present a particular type of nodes which are highly connected to other nodes in the network (hub nodes), while the remaining nodes are quite unconnected. Following (Yoo et al., 2007), we introduce the salience of a vertex (vi) as the sum of the weights of the edges that have as source or target the given vertex (2). Within the set of vertices, the n vertices with a higher salience (Hub Vertices) are selected and iteratively grouped in Hub Vertex Sets (HVS). The HVS are set of vertices which are strongly related to one another, and constitute the centroids of the clusters to construct. The remaining vertices (the ones not included in the HVS) are assigned to that cluster to which they are more connected. This is again an iterative process that adjusts the HVS and the vertices assigned progressively. IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs

24 Automatic Summarization of News using WordNet Concept Graphs
Conclusions Rouge Recall-Oriented Understudy for Gisting Evaluation Comparison between automatic summaries and model or ideal summaries created by humans Different recall metrics ROUGE-N (1…4) : number or ngrams co-ocurring in a candidate summary and the reference summaries ROUGE- L (Longest Common Subsequence): the number of the LCS is used to estimate similarity between the candidate and the reference summaries ROUGE-W-1.2 (Weighted Longest Common Subsequence): improves ROUGE-W by taking into account the presence of consecutive matches Formally, ROUGE-N is an n-gram recall between a candidate summary and a set of reference summaries. ROUGE-N is computed as follows: Where n stands for the length of the n-gram, gramn, and Countmatch(gramn) is the maximum number of n-grams co-occurring in a candidate summary and a set of reference summaries. ROUGE- L: The intuition is that the longer the LCS of two summary sentences is, the more similar the two summaries are. To improve the basic LCS method, we can simply remember the length of consecutive matches encountered so far to a regular two dimensional dynamic program table computing LCS. IADIS Informatics 2009 Automatic Summarization of News using WordNet Concept Graphs


Download ppt "Automatic Summarization of News using WordNet Concept Graphs"

Similar presentations


Ads by Google