Presentation is loading. Please wait.

Presentation is loading. Please wait.

Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM)

Similar presentations


Presentation on theme: "Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM)"— Presentation transcript:

1 Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM) Lab University of California, Santa Cruz

2 Outline Introduction Faceted Feedback – Facet-Value Pair Candidate Selection – Learning from Faceted Feedback Experiments – Settings – Results Summary 2

3 Personalized Information Filtering Identify user-desired documents from a document stream Two families of filtering approaches – Collaborative Filtering (CF) – Content-Based Filtering (CBF) Applications: news feeder, email spam filter, etc. 3 Filtering System News Blogs Emails Passed documents …

4 Semi-Structured Documents Increasingly prevalent over the Internet Emails, news, movies, tweets, etc. Plenty of metadata available 4

5 Definitions Facet: a metadata field – Date, Topic, Location, Director, Genre, etc. Facet-Value Pair (FVP): a metadata field assigned with a particular value – Topic: Royal wedding – Date: 04-29-2011 – Location: London, UK 5 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

6 Motivation Existing filtering approaches learn user interests based on users relevance judgments of documents Users may have prior knowledge on which facet-value pairs are relevant – English-only readers Language: English – Social network analysts Company: Facebook 6 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

7 7 Can we exploit users prior knowledge on facet-value pairs for filtering? Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

8 A New User Interaction Mechanism: Faceted Feedback 8 Filtering System FVP candidates: Lang: … Topic: … Date: … Relevant FVPs: Topic: … Lang: …

9 Research Questions Question 1 – How to select facet-value pair candidates? Question 2 – How to learn user profiles based on faceted feedback? 9 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

10 FVP Selection: Our Approach In a filtering task – A large number of unlabeled documents – Possibly a small number of labeled documents We rank facet-value pairs by 10 Pseudo relevant (positively classified) documents User-labeled relevant documents Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz Intuition: features that occur frequently among relevant docs while rarely in the whole corpus are very likely to be relevant

11 Research Questions Question 1 – How to select facet-value pair candidates? Question 2 – How to learn user profiles based on faceted feedback? 11 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

12 Content-Based Filtering (CBF) Treated as a binary text classification task User profile: a feature vector that represents a users information needs (interests/preferences) Given the user profile θ, a document can be determined as relevant or not according to: 12 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz Document vector Document label The core of CBF is learning the user profile!

13 Our Approach The assumption – A feature is selected by a user since it has a high correlation with the document label (R/NR) Generalized Constraint Model (GCM) 13 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

14 Correlation Decomposition Sufficiency – The probability of a document being relevant given that the feature has occurred: P(R + |f=1) – P(R + |f=1)=1 : sufficient features E.g., Company: Facebook for social network analysts Necessity – The probability of the feature having occurred given that a document is relevant: P(f=1|R + ) – P(f=1|R + )=1 : necessary features E.g., Language: English for English-only readers 14 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

15 Examples: Highly-Correlated Features 15 The whole corpus R+R+ f 2 =1 f 1 =1 f 3 =1 1) f 1 is a sufficient feature since P(R + |f 1 =1)=1 2) f 2 is a necessary feature since P(f 2 =1|R + )=1 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz 3) f 3 is neither necessary nor sufficient, but both its sufficiency and necessity are high (>0.5)

16 Estimating Sufficiency 16 Document label The feature The set of documents covered by feature f User profile vector Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz Estimation of the label of document d i

17 Estimating Necessity 17 Feature sufficiency Bayes Theorem! Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz Prior distribution

18 Reference Distributions Our assumption – User selects a feature since it has a high sufficiency and/or a high necessity Reference distributions: two Bernoulli distns – The sufficiency/necessity of a user-selected feature should be close to the reference distribution – KL-divergence for similarity measure 18 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

19 User Profile Learning The unified loss function to combine two types of feedback: 19 User-labeled documents Necessary features Sufficient features Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz T s, T n : reference distns

20 User Interaction Mechanisms Two mechanisms – Mechanism 1: ask users to select features they think are relevant – Mechanism 2: ask users to specifically select features they think are sufficient and necessary respectively 20Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

21 Outline Introduction Faceted Feedback – Facet-Value Pair Candidate Selection – Learning from Faceted Feedback Experiments – Settings – Results Summary 21

22 Data Sets Use two data sets from TREC filtering track – TREC 2000: OHSUMED (348566 medical articles) + 63 topics (information needs) Metadata field: MeSH (Medical Subject Headings) – TREC 2002: RCV1 (~800,000 news articles) + 50 topics defined by human assessors Metadata fields: Topic, Industry, Region Split each topic set into two equal-size subsets – One for parameter tuning, the other for testing 22 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

23 Faceted Feedback Collection Recruit subjects on Mechanical Turk – Five subjects per topic – The average performances will be reported For each topic, we show subjects – The topic description (information need) – A group of facet-value pair candidates 23 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

24 Evaluation Metrics Precision (macro) Recall (macro) T11U = 2 * N rd – N nd – N rd : the number of relevant docs delivered – N nd : the number of non-relevant docs delivered T11SU = – MinNU = -0.5 – MaxU: the maximum possible utility (T11U) 24 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

25 Outline Introduction Faceted Feedback – Facet-Value Pair Candidate Selection – Learning from Faceted Feedback Experiments – Settings – Results Summary 25

26 Results 1: w/wo Faceted Feedback (FF) 26 Faceted feedback improves filtering performances, especially when fewer relevant documents are initially known. Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz # relevant docs initially known

27 Outline Introduction Faceted Feedback – Facet-Value Pair Candidate Selection – Learning from Faceted Feedback Experiments – Settings – Results Summary 27

28 Summary Faceted feedback is useful for filtering, especially in the cold-start scenarios The Generalized Constraint Model (GCM) is a robust user profile learning algorithm In future work, we will evaluate our methods on data sets where faceted features are more important – Movie, music, product, etc. 28 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

29 Questions? 29 Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM) Lab University of California, Santa Cruz lanbo@soe.ucsc.edu yiz@soe.ucsc.edu xingqianli@gmail.com


Download ppt "Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM)"

Similar presentations


Ads by Google