Presentation is loading. Please wait.

Presentation is loading. Please wait.

Effects of overlaying ontologies to TextRank graphs Project Report By Kino Coursey.

Similar presentations


Presentation on theme: "Effects of overlaying ontologies to TextRank graphs Project Report By Kino Coursey."— Presentation transcript:

1 Effects of overlaying ontologies to TextRank graphs Project Report By Kino Coursey

2 Outline Introduction & Background Introduction & Background Ontology based Summarization Ontology based Summarization Evaluation Evaluation Discussion Discussion Future Work Future Work Conclusion Conclusion

3 Motivation An exponentially increasing volume of information requires summarization An exponentially increasing volume of information requires summarization –Humans are finite –Text is being generated faster than a reader can read –Need to quickly identify the relevance of documents

4 Central Question: Does knowing more really help? TextRank and a number of other random walk NLP algorithms have been applied to different areas like text summarization and keyword extraction. TextRank and a number of other random walk NLP algorithms have been applied to different areas like text summarization and keyword extraction. How would additional information from an ontology like WordNet or Cyc would affect such algorithms. Would it be better or worse? How would additional information from an ontology like WordNet or Cyc would affect such algorithms. Would it be better or worse?

5 Evaluation Criteria The evaluation criteria would be the change in performance of TextRank when given the extra information. The evaluation criteria would be the change in performance of TextRank when given the extra information. The evaluation dataset will be the Document Understanding Conference 2002 (DUC-2002) summarization test set The evaluation dataset will be the Document Understanding Conference 2002 (DUC-2002) summarization test set The ROUGE summarization evaluation tool will be used to measure performance change The ROUGE summarization evaluation tool will be used to measure performance change

6 Project Plan Implement TextRank Implement TextRank Construct a algorithm to import data from Cyc into TextRank Construct a algorithm to import data from Cyc into TextRank Construct evaluation dataset preprocessor Construct evaluation dataset preprocessor Develop a parameter tuning process Develop a parameter tuning process Measure performance with optimal parameters Measure performance with optimal parameters Analyze and report results Analyze and report results

7 Implementation Implemented Intelligent surfer model in Perl Implemented Intelligent surfer model in Perl Implemented text-to-Cyc graph extraction Implemented text-to-Cyc graph extraction –Denotation map –Using: isa, genls, conceptuallyRelated, mainDomain, definingMt Explored graph visualization technology (easier to debug what you can see) Explored graph visualization technology (easier to debug what you can see) –Nodes3d from BrainMaps.org

8 Ontology Based Summarization Augment TextRank with Cyc relationships Augment TextRank with Cyc relationships –Perform initial context free mapping into Cyc Terms –Perform Ranking process –Select the highest ranked sentences as extractive summary

9 Intelligent Surfer Model The Standard Model Intelligent Surfer Model For all nodes use  For all nodes use --> Constraint on S i  S i apportioned as a function of query relevancy. Here words in the input text have S i = 1/N while all other nodes have S i =0. When you get tired you jump back to the “problem statememt”, the input.

10 Weighted Version Sum of the outputs  Weighted updates  Summation of the weighted outputs of the currently ranked nodes

11 From text to Cyc graph Text-to-Cyc graph extraction Text-to-Cyc graph extraction –Denotation map –Using: isa, genls, conceptuallyRelated, mainDomain, definingMt –Each edge has its own weight associated with it –Finding the right weight is its own process

12 Finding the right terms (denotation-mapper "Hurricane Gilbert swept toward the Dominican Republic Sunday") Results : (("Hurricane". HurricaneAsObject)HurricaneAsObject ("Hurricane". HurricaneAsEvent) ("Gilbert". JohnGilbert) ("Gilbert". JodyGilbert) ("Gilbert". MelissaGilbert) ("Gilbert". GilbertStuart-TheArtist) ("Gilbert". GilbertGottfried) ("swept". SweepingAnArea) ("swept". (ThingDescribableAsFn Sweep-TheWord Adjective)) ("toward". (HypothesizedPrepositionSenseFn Toward-TheWord Preposition)) ("the Dominican Republic". DominicanRepublic) ("Sunday". wikip-Sunday) ("Sunday". (ThingDescribableAsFn Sunday-TheWord Adjective)))HurricaneAsEventJohnGilbertJodyGilbertMelissaGilbertGilbertStuart-TheArtistGilbertGottfriedSweepingAnArea(ThingDescribableAsFnSweep-TheWordAdjective(HypothesizedPrepositionSenseFnToward-TheWordPrepositionDominicanRepublicwikip-Sunday(ThingDescribableAsFnSunday-TheWordAdjective

13 The Big View

14 Tuning the system with Genetic Algorithms A Steady State Genetic Algorithm was used to find an optimal weighting compared against ROUGE-S on a subset of documents.

15 Genetic Algorithm & Evaluation Function 1.Select k members for tournament (here k=4). 2.For all members in tournament evaluate performance on the task and compute fitness. 3.Perform tournament selection by sorting based on fitness and creating a parent set and a replacement set. 4.Copy parents over replacement set to make children. 5.Do mutation and crossover operations on children. 6.Go to step 1.

16 Initial GA Evaluation DocumentTextRank OntoRank Ratio 10.0918 0.09521.0370 20.4095 0.39370.9612 30.2035 0.19910.9787 40.2687 0.28231.0506 50.0546 0.05881.0769 60.1778 0.22221.2500 70.3025 0.40341.3333 80.2507 0.25071.0000 90.1000 0.09520.9524 100.1685 0.15750.9348 AVG1.0575 GA was run on a random subset of documents that scored below average with default settings, and was run until it provided a +5.75% gain over TextRank on the ROUGE-S scores.

17 Combined Ranking: HurricanAsObject vs. Hurricane as Event Commonsense distinctions that vary from an ontology like WordNet. HurricaneAsObject: “Hurricane Gilbert moved to the north …” HurricaneAsEvent: “During Hurricane Gilbert many trees were …

18 Combined Ranking: Many Gilberts but one hurricane topic …. Gilbert is an ambiguous word for Cyc Gilbert is an ambiguous word for Cyc Yet the words primary connections are topic related Yet the words primary connections are topic related Similar to human name association in context Similar to human name association in context

19 EVALUATIONS Initial GA scores showed a +5% improvement Initial GA scores showed a +5% improvement Evaluation on the whole dataset Evaluation on the whole dataset Shocking Revelation Shocking Revelation Re-Evaluation Re-Evaluation

20 First Full evaluation Performed full per-document evaluation on DUC-2002 Performed full per-document evaluation on DUC-2002 Carried out detailed per-document review of relative performance using ROUGE-S Carried out detailed per-document review of relative performance using ROUGE-S

21 Disappointing full dataset performance

22 Debugging via Histogramming Sorted the relative performance on a per- document basis High variance, with average positive effect +15% and average negative effect -14% Unfortunately more often negative than positive, so a net negative skew

23 Revelation While working on a distributed version of TextRank discovered the two datasets in DUC-2002 While working on a distributed version of TextRank discovered the two datasets in DUC-2002 –The per-document generative summary –The multi-document extractive summary Of course the system was using the generative summary to evaluate an extractive system ! Of course the system was using the generative summary to evaluate an extractive system ! Convert and Re-Test on the multi-document dataset Convert and Re-Test on the multi-document dataset No time to re-evolve using the GA for the multi- document data No time to re-evolve using the GA for the multi- document data

24 Multi-document Re-Evaluation

25 Evaluation Conclusions Much more encouraging when comparing same data types Much more encouraging when comparing same data types Initial weakness prompted analysis of negative result leading to theory covered in discussion Initial weakness prompted analysis of negative result leading to theory covered in discussion No breakthrough No breakthrough

26 Discussion Adding the commonsense graph produces wide variation in TextRank performance both positive and negative. Adding the commonsense graph produces wide variation in TextRank performance both positive and negative. –TextRank tries to preserve the total information present in a graph –Adding commonsense to the graph can identify what a reader should be interested in as well as what they probably already know –In the first case there is an improvement : disambiguation and context are selected –In the second you transmit redundant information … common sense, and reduce the effective bandwidth of the summary

27 Discussion Identification of stopconcepts Identification of stopconcepts –The ontology version of stopwords –Nodes that have so much connectivity that they contain little information –Created a stopconcepts list

28 Future Work Run the GA on the multi-document data set Run the GA on the multi-document data set Develop the ability to detect novel information from redundant information Develop the ability to detect novel information from redundant information The Ontology ranking process itself is useful The Ontology ranking process itself is useful –Ontological debugging –Familiarization with the language of the ontology via a form of parallel text

29 Conclusions Adding commonsense graphs to TextRank can affect the performance both positively and negatively Adding commonsense graphs to TextRank can affect the performance both positively and negatively Need to identify how to modulate the effects of commonsense information Need to identify how to modulate the effects of commonsense information Having the right data helps! Having the right data helps! Spin-offs for the text-to-ontology graph can be useful Spin-offs for the text-to-ontology graph can be useful

30 References [Richardson and Domingos 2002] Richardson and Domingos, The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank, NIPS 2002 [Richardson and Domingos 2002] Richardson and Domingos, The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank, NIPS 2002 [Mihalcea and Tarau 2004] Mihalcea, R. and Tarau, P. TextRank: Bringing Order Into Texts, EMNLP 2004 [Mihalcea and Tarau 2004] Mihalcea, R. and Tarau, P. TextRank: Bringing Order Into Texts, EMNLP 2004 [Mihalcea, et al 2004] Mihalcea, R. and Tarau, P and Figa, E. PageRank on Semantic Networks with Application to Word Sense Disambiguation, COLING 2004 [Mihalcea, et al 2004] Mihalcea, R. and Tarau, P and Figa, E. PageRank on Semantic Networks with Application to Word Sense Disambiguation, COLING 2004 [Mihalcea, et al 2005] Mihalcea, R. and Tarau, P and Figa, E. Paul Tarau, Rada Mihalcea and Elizabeth Figa, Semantic Document Engineering with WordNet and PageRank, in Proceedings of the ACM Conference on Applied Computing (ACM-SAC 2005), New Mexico, March 2005 [Mihalcea, et al 2005] Mihalcea, R. and Tarau, P and Figa, E. Paul Tarau, Rada Mihalcea and Elizabeth Figa, Semantic Document Engineering with WordNet and PageRank, in Proceedings of the ACM Conference on Applied Computing (ACM-SAC 2005), New Mexico, March 2005 [Mihalcea and Tarau Patent] Mihalcea, R. and Tarau, P. Graph-based ranking algorithms for text processing, Patent application #20050278325 [Mihalcea and Tarau Patent] Mihalcea, R. and Tarau, P. Graph-based ranking algorithms for text processing, Patent application #20050278325 [Mihalcea and Tarau 2005] Mihalcea, R. and Tarau, P. Multi-Document Summarization with Iterative Graph-based Algorithms, Proceedings of the First International Conference on Intelligent Analysis Methods and Tools (IA 2005), McLean, VA, May 2005 [Mihalcea and Tarau 2005] Mihalcea, R. and Tarau, P. Multi-Document Summarization with Iterative Graph-based Algorithms, Proceedings of the First International Conference on Intelligent Analysis Methods and Tools (IA 2005), McLean, VA, May 2005

31 References [Conyon and Muldoon 2006] M. J. Conyon and M. R. Muldoon (2006) Ranking the Importance of Boards of Directors. [Conyon and Muldoon 2006] M. J. Conyon and M. R. Muldoon (2006) Ranking the Importance of Boards of Directors. [Lin and Hovy 2003] Lin, Chin-Yew and E.H. Hovy. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003), Edmonton, Canada, May 27 - June 1, 2003. [Lin and Hovy 2003] Lin, Chin-Yew and E.H. Hovy. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003), Edmonton, Canada, May 27 - June 1, 2003. [Nordin and Banzhaf 1997] P. Nordin and W. Banzhaf, "Real time control of a Khepera robot using genetic programming," Cybernetics and Control, Vol. 26, No. 3, pp. 533- 561, 1997. [Nordin and Banzhaf 1997] P. Nordin and W. Banzhaf, "Real time control of a Khepera robot using genetic programming," Cybernetics and Control, Vol. 26, No. 3, pp. 533- 561, 1997. [de Jager 2004] de Jager, D., “PageRank: Three distributed algorithms,” M.Sc. thesis, Department of Computing, Imperial College London, London SW7 2BZ, UK, September 2004. [de Jager 2004] de Jager, D., “PageRank: Three distributed algorithms,” M.Sc. thesis, Department of Computing, Imperial College London, London SW7 2BZ, UK, September 2004. [Brin and Page 1998] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Seventh International World Wide Web Conference, Brisbane, Australia, 1998. http://citeseer.nj.nec.com/brin98anatomy.html [Brin and Page 1998] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Seventh International World Wide Web Conference, Brisbane, Australia, 1998. http://citeseer.nj.nec.com/brin98anatomy.htmlhttp://citeseer.nj.nec.com/brin98anatomy.html [Ding, et al 2004 ] L. Ding, T. Finin, A. Joshi, R. Pan, R.S. Cost, Y. Peng, P. Riddivari, V. Doshi, and J. Sachs. Swoogle: a search and metadata engine for the semantic web. In Proc. of the 13th ACM Conference on Information and Knowledge Management, pages 652--659, 2004. [Ding, et al 2004 ] L. Ding, T. Finin, A. Joshi, R. Pan, R.S. Cost, Y. Peng, P. Riddivari, V. Doshi, and J. Sachs. Swoogle: a search and metadata engine for the semantic web. In Proc. of the 13th ACM Conference on Information and Knowledge Management, pages 652--659, 2004.


Download ppt "Effects of overlaying ontologies to TextRank graphs Project Report By Kino Coursey."

Similar presentations


Ads by Google