Presentation is loading. Please wait.

Presentation is loading. Please wait.

Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM)

Similar presentations


Presentation on theme: "Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM)"— Presentation transcript:

1 Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM) Lab University of California, Santa Cruz

2 Personalized Information Filtering Identify user-desired documents from a document stream Two families of filtering approaches – Collaborative Filtering (CF) – Content-Based Filtering (CBF) Applications: news feeder, spam filter, etc. 2 Filtering System News Blogs s Passed documents …

3 Semi-Structured Documents Increasingly prevalent over the Internet s, news, movies, tweets, etc. Plenty of metadata available 3

4 Definitions Facet: a metadata field – Date, Topic, Location, Director, Genre, etc. Facet-Value Pair (FVP): a metadata field assigned with a particular value – Topic: Royal wedding – Date: – Location: London, UK 4 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

5 Motivation Existing filtering approaches learn user interests based on users relevance judgments of documents Users may have prior knowledge on which facet-value pairs are relevant – English-only readers Language: English – Social network analysts Company: Facebook 5 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

6 6 Can we exploit users prior knowledge on facet-value pairs for filtering? Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

7 A New User Interaction Mechanism: Faceted Feedback 7 Filtering System FVP candidates: Lang: … Topic: … Date: … Relevant FVPs: Topic: … Lang: …

8 Research Questions Question 1 – How to select facet-value pair candidates? Question 2 – How to learn user profiles based on faceted feedback? 8 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

9 Q1: Possible Methods Feature selection methods for text classification – E.g., Mutual Information, Chi-Square measure, etc. Usually a large number of labeled documents available Query expansion methods for retrieval – E.g., TFIDF score on pseudo relevant documents No labeled documents available 9 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

10 FVP Selection: Our Approach In a filtering task – A large number of unlabeled documents – Possibly a small number of labeled documents We rank facet-value pairs by 10 Pseudo relevant (positively classified) documents User-labeled relevant documents Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz Intuition: features that occur frequently among relevant docs while rarely in the whole corpus are very likely to be relevant

11 Research Questions Question 1 – How to select facet-value pair candidates? Question 2 – How to learn user profiles based on faceted feedback? 11 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

12 Content-Based Filtering (CBF) Treated as a binary text classification task User profile: a feature vector that represents a users information needs (interests/preferences) Given the user profile θ, a document can be determined as relevant or not according to: 12 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz Document vector Document label The core of CBF is learning the user profile!

13 Q2: Possible Methods Simple methods – Boolean strategy (AND, OR) – Feature selection – Pseudo relevant document Sophisticated methods – Bayesian logistic regression with an adjusted prior (Dayanik et al. 06) – Generalized Expectation Criteria (Druck et al. 08) 13 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

14 Our Approach The assumption – A feature is selected by a user since it has a high correlation with the document label (R/NR) Generalized Constraint Model (GCM) 14 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

15 Correlation Decomposition Sufficiency – The probability of a document being relevant given that the feature has occurred: P(R + |f=1) – P(R + |f=1)=1 : sufficient features E.g., Company: Facebook for social network analysts Necessity – The probability of the feature having occurred given that a document is relevant: P(f=1|R + ) – P(f=1|R + )=1 : necessary features E.g., Language: English for English-only readers 15 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

16 Examples: Highly-Correlated Features 16 The whole corpus R+R+ f 2 =1 f 1 =1 f 3 =1 1) f 1 is a sufficient feature since P(R + |f 1 =1)=1 2) f 2 is a necessary feature since P(f 2 =1|R + )=1 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz 3) f 3 is neither necessary nor sufficient, but both its sufficiency and necessity are high (>0.5)

17 Estimating Sufficiency 17 Document label The feature The set of documents covered by feature f User profile vector Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz Estimation of the label of document d i

18 Estimating Necessity 18 Feature sufficiency Bayes Theorem! Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz Prior distribution

19 Reference Distributions Our assumption – User selects a feature since it has a high sufficiency and/or a high necessity Reference distributions: two Bernoulli distns – The sufficiency/necessity of a user-selected feature should be close to the reference distribution – KL-divergence for similarity measure 19 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

20 User Profile Learning The unified loss function to combine two types of feedback: 20 User-labeled documents Necessary features Sufficient features Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz T s, T n : reference distns

21 User Interaction Mechanisms Two mechanisms – Mechanism 1: ask users to select features they think are relevant – Mechanism 2: ask users to specifically select features they think are sufficient and necessary respectively 21Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

22 Outline Introduction Faceted Feedback – Facet-Value Pair Candidate Selection – Learning from Faceted Feedback Experiments – Settings – Results Summary 22

23 Data Sets Use two data sets from TREC filtering track – TREC 2000: OHSUMED ( medical articles) + 63 topics (information needs) Metadata field: MeSH (Medical Subject Headings) – TREC 2002: RCV1 (~800,000 news articles) + 50 topics defined by human assessors Metadata fields: Topic, Industry, Region Split each topic set into two equal-size subsets – One for parameter tuning, the other for testing 23 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

24 Faceted Feedback Collection Recruit subjects on Mechanical Turk – Five subjects per topic – The average performances will be reported For each topic, we show subjects – The topic description (information need) – A group of facet-value pair candidates 24 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

25 Evaluation Metrics Precision (macro) Recall (macro) T11U = 2 * N rd – N nd – N rd : the number of relevant docs delivered – N nd : the number of non-relevant docs delivered T11SU = – MinNU = -0.5 – MaxU: the maximum possible utility (T11U) 25 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

26 Outline Introduction Faceted Feedback – Facet-Value Pair Candidate Selection – Learning from Faceted Feedback Experiments – Settings – Results Summary 26

27 Results 1: w/wo Faceted Feedback (FF) 27 Faceted feedback improves filtering performances, especially when fewer relevant documents are initially known. Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz # relevant docs initially known

28 Results 2: Different Learning Algorithms 28 Our approach Existing approaches BOOL(A), BOOL(O): Boolean strategy FS: feature selection based on FF Pseudo-D/Q: pseudo relevant doc/query Prior: logistic regression with Bayesian prior GEC: generalized expectation criteria

29 Outline Introduction Faceted Feedback – Facet-Value Pair Candidate Selection – Learning from Faceted Feedback Experiments – Settings – Results Summary 29

30 Summary Faceted feedback is useful for filtering, especially in the cold-start scenarios The Generalized Constraint Model (GCM) is a robust user profile learning algorithm In future work, we will evaluate our methods on data sets where faceted features are more important – Movie, music, product, etc. 30 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

31 Questions? 31 Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM) Lab University of California, Santa Cruz


Download ppt "Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM)"

Similar presentations


Ads by Google