Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hierarchical Summaries By: Dawn J. Lawrie University of Massachusetts, Amherst for Search.

Similar presentations


Presentation on theme: "Hierarchical Summaries By: Dawn J. Lawrie University of Massachusetts, Amherst for Search."— Presentation transcript:

1 Hierarchical Summaries By: Dawn J. Lawrie University of Massachusetts, Amherst for Search

2 Dawn J. Lawrie University of Massachusetts, Amherst 2 The Problem

3 Dawn J. Lawrie University of Massachusetts, Amherst 3 Possible Solution

4 Dawn J. Lawrie University of Massachusetts, Amherst 4 Possible Solution

5 Dawn J. Lawrie University of Massachusetts, Amherst 5 Solution: Automatic Hierarchies

6 Dawn J. Lawrie University of Massachusetts, Amherst 6 Strengths of Automatic Hierarchies Word-based summary Focus on topics of the documents Allows users to navigate through the results Easy to understand Bonus: Useful for summarizing documents

7 Dawn J. Lawrie University of Massachusetts, Amherst 7 Endangered Animals (29 10 ) marine mammals (18 8 ) Hand-generated hierarchy of 50 documents Query: “Endangered Species (Mammals)” Example legislation (6 4 )permits (10 2 )Critical Habitat (16 0 ) Endangered plants (7 0 ) Ecosystem Management (2 0 ) Threatened (1 0 ) Endangered Species Act (1 0 ) mammals (17 10 ) fish (7 0 ) birds (3 0 ) insects (3 0 ) amphibians (1 0 ) sea lions (2 2 ) manatees (1 1 ) whales (7 4 ) jaguars (2 0 ) marine (12 8 ) deer (1 1 ) habitat protection (1 1 ) rats (1 0 ) Hawaii (3 0 ) California (2 0 ) Utah (1 0 ) Virginia (1 0 ) Melicope Species (1 0 ) Wainae Plant Cluster Recovery Plan (1 0 ) Waianae Mountains (1 0 )

8 Dawn J. Lawrie University of Massachusetts, Amherst 8 Proposed Framework Document Set Language Model Term Selection Algorithm Hierarchy “Term” = word or phrase

9 Dawn J. Lawrie University of Massachusetts, Amherst 9 Challenges Selecting terms for the hierarchy Displaying the hierarchy Showing that it works

10 Dawn J. Lawrie University of Massachusetts, Amherst 10 Outline Introduction Description of framework for creating hierarchies Examples Methods of evaluation Future Improvements

11 Dawn J. Lawrie University of Massachusetts, Amherst 11 Methodology Build probabilistic word model of documents Find “best” terms On topic Predictive Recursive definition creates hierarchy

12 Dawn J. Lawrie University of Massachusetts, Amherst 12 Term characteristics Why topicality? Distinguish topic terms from the rest of the vocabulary The Secretary of Interior listed bald eagles south of the 40 th parallel as endangered under the Endangered Species Preservation Act of 1966. Why predictiveness? Topic words can be strongly related Represent different facets of the vocabulary Example: P(“Endangered”|”Stellar sea lions”) = 1.00

13 Dawn J. Lawrie University of Massachusetts, Amherst 13 Statistical Model A T refers to topicality with respect to topic T Find if the word w is in set T B refers to predictiveness Precondition for other terms to occur Find if word w is in set P

14 Dawn J. Lawrie University of Massachusetts, Amherst 14 Probabilistic Word Model Captures statistical information about text Called a “language model” in speech recognition Provides basis for estimation of probabilities

15 Dawn J. Lawrie University of Massachusetts, Amherst 15 Estimating Topicality Use term’s contribution to relative entropy Compares two models using K-L divergence Model of documents in hierarchy Model of general English

16 Dawn J. Lawrie University of Massachusetts, Amherst 16 KL Example mammal fishery species marine endangered

17 Dawn J. Lawrie University of Massachusetts, Amherst 17 Estimating Predictiveness Relates the vocabulary to a set of candidate topic terms Use conditional probability - P x ( t | v ) x is the maximum distance between t and v mammal marine mammal marine species fishery species fishery.98.31.35.99.31.65 t v P(t|v)P(t|v).35.50.03.04.01

18 Dawn J. Lawrie University of Massachusetts, Amherst 18 Dominating Set Approximation Interpret predictive language model as graph edges weighted by the conditional probability Finds terms that are connected to lots of terms with a high weight Chooses topic terms until vocabulary is dominated (predicted)

19 Dawn J. Lawrie University of Massachusetts, Amherst 19 Term Selection Example P(t|v) v t

20 Dawn J. Lawrie University of Massachusetts, Amherst 20 Generating a Summary 4-step process (1) Preprocess document set (2) Generate a language model (3) Select the terms (4) Create a Hierarchy recursive

21 Dawn J. Lawrie University of Massachusetts, Amherst 21 Outline Introduction Description of framework for creating hierarchies Examples Methods of evaluation Future Improvements

22 Dawn J. Lawrie University of Massachusetts, Amherst 22 Example Hierarchies Generated from 50 documents retrieved for the query: Endangered Species - Mammals Demonstrate the difference between using different topic models Web hierarchy using same query

23 Dawn J. Lawrie University of Massachusetts, Amherst 23 endangered (8 6 ) Act (4 1 ) State (3 2 ) Committee (4 3 ) address (8 5 ) operations (4 3 ) incidental take (4 2 ) NMFS (6 4 ) population (3 2 ) commercial fishing operations (4 2 ) regulations (12 4 ) fish (11 7 ) permit (14 6 ) number (9 3 ) bill (5 1 ) Secretary (7 3 ) research (10 5 ) amended (15 4 ) Uniform Topic Model Hierarchy marine (18 7 ) species (43 9 ) plan (19 2 ) marine mammals (18 7 )

24 Dawn J. Lawrie University of Massachusetts, Amherst 24 KL-Topic Model Hierarchy species (43 9 ) Marine Mammal Protection Act (7 3 ) marine mammals (18 7 ) management plan (5 1 ) marine (18 7 ) Endangered Species Act (29 4 ) endangered species (20 4 ) habitat (28 3 ) mammals (12 6 ) Marine Mammal Commission (2 1 ) fish (27 7 ) National Marine Fisheries Service (11 3 ) Act (31 3 ) permit (16 4 ) protection (24 4 ) marine mammal stocks (2 0 ) marine mammal species (4 2 ) fishery (5 3 ) Secretary (4 2 ) NMFS (8 3 ) stock (5 1 ) fish species (3 2 ) MMPA (5 1 ) incidental (7 4 ) research (6 3 )

25 Dawn J. Lawrie University of Massachusetts, Amherst 25 Query-Topic Model Hierarchy marine mammal species (4 2 ) incidental (7 4 ) fish species (3 2 ) threatened species (1 0 ) requirements (8 4 ) population (6 2 ) MMPA (5 1 ) Protected Resources (2 1 ) marine mammal stocks (2 0 ) animals (4 2 ) habitat (28 3 ) endangered species (20 4 ) mammals (12 6 ) listed species (11 0 ) endangered species program (3 0 ) Act (31 3 ) National Marine Fisheries Service (11 3 ) Marine Mammal Protection Act (7 3 ) Wildlife Service (17 2 ) listed (26 0 ) Endangered Species Act (33 5 ) marine mammals (18 7 ) Federal (23 2 ) marine (18 7 ) species (43 9 )

26 Dawn J. Lawrie University of Massachusetts, Amherst 26 Web Hierarchies Submit query to a web search engine Gather titles and snippets of documents Text considered a document Documents are about 30 words

27 Dawn J. Lawrie University of Massachusetts, Amherst 27 species of marine mammals (1) Listed Species (1) Species Information (1) Endangered Species Act (8) Protected Resources (2) sea otter (2) whales (13) dolphins (7) Cetaceans (2) marine (76) Mammals species (4) Canadian Endangered Species (3) federal Endangered Species (1) marine mammals (91) Endangered Mammals (22) threatened (144) species of mammals (27) endangered mammal species (4) birds (140) British mammals (4) animal species (1) Critically Endangered Mammals (2) Animal Info (2) Ecosystems (2) Scientists (2) species of marine mammals (1) Endangered Species Coalition (2) Endangered Spaces (2) List of Endangered Species (5) marine mammals (97) birds (114) Endangered Mammals (13) threatened (78) small mammals (13) large mammals (12) Example of Web Hierarchy Endangered Species (440) endangered (491) mammals (600) terrestrial mammals (2) endangered marine species (2) Species Management (2) marine species (4) listing of species (1) protected species (2) native species (1) Candidate species (2) 100 species (1) new species (1)

28 Dawn J. Lawrie University of Massachusetts, Amherst 28 Outline Introduction Description of framework for creating hierarchies Examples Methods of evaluation Future Improvements

29 Dawn J. Lawrie University of Massachusetts, Amherst 29 Evaluations Summary Evaluation Tests how well the topic terms chosen predict the vocabulary Access Evaluation Compare number of documents a user can find Relevance Evaluation Path length to find all relevant documents

30 Dawn J. Lawrie University of Massachusetts, Amherst 30 Automatic Evaluation Test Set Use 50 standard queries Document sets 500 documents retrieved from TREC volumes 4 and 5 (have relevance judgments) 200 documents retrieved from a news database 1000 titles and snippets retrieved using Google™ Search Engine

31 Dawn J. Lawrie University of Massachusetts, Amherst 31 Evaluating Hypotheses Use KL-topic model Use sub-collections Summary Access Relevance TREC Collection and News Documents -Denotes an evaluation confirmed hypothesis -Denotes evaluation showed no significant difference

32 Dawn J. Lawrie University of Massachusetts, Amherst 32 Web Document Evaluation Results completely different Best hierarchy uniform topic model Hierarchies do not look as good to human inspection

33 Dawn J. Lawrie University of Massachusetts, Amherst 33 User Study Include 12 to 16 users Compare ranked list and hierarchy to ranked list alone Users asked to find all instances that are relevant to the query Only have to identify one document about a particular instance Study includes 10 queries

34 Dawn J. Lawrie University of Massachusetts, Amherst 34 Future Work Complete user study Failure Analysis Explore the use of topic hierarchies in other organizational tasks Personal collections of documents E-mails

35 Dawn J. Lawrie University of Massachusetts, Amherst 35 Conclusions Developed a formal framework for topic hierarchies Created hierarchies from full text and snippets of documents Verified intuition concerning hierarchies generated from full text

36 Dawn J. Lawrie University of Massachusetts, Amherst 36 Questions? Demo: http://www-ciir.cs.umass.edu/~lawrie/categories/google-qry/

37 Dawn J. Lawrie University of Massachusetts, Amherst 37 Improving Topicality Estimate Estimate topicality using a query model Emphasizes query related terms Improve model of English with sub-collections Distinguishes between terms that are frequently used in a genre and topic terms

38 Dawn J. Lawrie University of Massachusetts, Amherst 38 Key Ideas Language models are created from documents Topicality and predictiveness are used to choose terms Topicality is estimated using Kullback-Leibler divergence Predictiveness is estimated by calculating conditional probabilities

39 Dawn J. Lawrie University of Massachusetts, Amherst 39 Key Ideas Showed the effect of using topic model Observed trade-off between snippets and full text of documents

40 Dawn J. Lawrie University of Massachusetts, Amherst 40 Summary Evaluation Expected Mutual Information Measure Two sets deviates from stochastic independence Shows how well the topic terms chosen predict the vocabulary

41 Dawn J. Lawrie University of Massachusetts, Amherst 41 Hierarchy Terms vs. Top TF.IDF Terms TF.IDF popular term weight Commonly used as a method of naming clusters Compare equal number of unique terms in hierarchy to top TF.IDF Terms Results Hierarchies always significantly better at summarizing documents

42 Dawn J. Lawrie University of Massachusetts, Amherst 42 Compare number of documents that are accessible Example policy Examine parts of the hierarchy with 20 or fewer documents Look at the top 200 documents in a ranked list Access Evaluation

43 Dawn J. Lawrie University of Massachusetts, Amherst 43 Level Size = 5 Rank 400 Rank 350 Rank 300 Rank 250 Rank 200 Hier. Topics = 50 Hier. Topics = 45 Hier. Topics = 40 Rank 150 Hier. Topics = 35 Hier. Topics = 30 Hier. Topics = 25 Hier. Topics = 20 Hier. Topics = 15 Hier. Topics = 10 Rank 100 Hier. Topics = 5 Rank 50 Level Size = 10 Rank 500 Rank 450 Rank 400 Rank 350 Rank 300 Hier. Topics = 50 Hier. Topics = 45 Hier. Topics = 40 Hier. Topics = 35 Rank 250 Hier. Topics = 30 Hier. Topics = 25 Hier. Topics = 20 Hier. Topics = 15 Hier. Topics = 10 Rank 200 Hier. Topics = 5 Rank 150 Level Size = 20 Rank 500 Rank 450 Hier. Topic = 50 Hier. Topic = 45 Hier. Topic = 40 Rank 400 Hier. Topics = 35 Hier. Topics = 30 Hier. Topics = 25 Hier. Topics = 20 Hier. Topics = 15 Hier. Topics = 10 Rank 350 Hier. Topics = 5 Rank 300 Rank 250 Rank 200 Rank 150 Level Size = 15 Rank 500 Rank 450 Rank 400 Rank 350 Hier. Topics = 50 Hier. Topics = 45 Hier. Topics = 40 Hier. Topics = 35 Hier. Topics = 30 Hier. Topics = 25 Hier. Topics = 20 Hier. Topics = 15 Rank 300 Hier. Topics = 10 Hier. Topics = 5 Rank 250 Rank 200 Rank 150 Hierarchy vs. Ranked List

44 Dawn J. Lawrie University of Massachusetts, Amherst 44 Relevance Evaluation Calculate average path to a relevant document Assumptions one does not read extraneous menus one reads all documents at a node Ignores relevant documents that are not in the hierarchy Smaller score denotes a better hierarchy

45 Dawn J. Lawrie University of Massachusetts, Amherst 45 Key Ideas Automatic evaluations have confirmed hypotheses for hierarchies created from full text of documents User study is necessary for determining how well people can use hierarchies


Download ppt "Hierarchical Summaries By: Dawn J. Lawrie University of Massachusetts, Amherst for Search."

Similar presentations


Ads by Google