Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM)

Slides:



Advertisements
Similar presentations
You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Advertisements

3.6 Support Vector Machines
Advanced Piloting Cruise Plot.
Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
Chapter 7 System Models.
Chapter 1 The Study of Body Function Image PowerPoint
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
OvidSP Flexible. Innovative. Precise. Introducing OvidSP Resources.
UNITED NATIONS Shipment Details Report – January 2006.
Library 1 Electronic Resources in the EUI Library Veerle Deckmyn, Library Director Aimee Glassel, Electronic Resources Librarian September 2, 2009.
Electronic Resources in the EUI Library
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Exit a Customer Chapter 8. Exit a Customer 8-2 Objectives Perform exit summary process consisting of the following steps: Review service records Close.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 10 second questions
1 Term 2, 2004, Lecture 9, Distributed DatabasesMarian Ursu, Department of Computing, Goldsmiths College Distributed databases 3.
ZMQS ZMQS
Richmond House, Liverpool (1) 26 th January 2004.
Relevance Feedback & Query Expansion
Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM)
Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)
ABC Technology Project
EU market situation for eggs and poultry Management Committee 20 October 2011.
A Quest for an Internet Video Quality-of-Experience Metric
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
VOORBLAD.
15. Oktober Oktober Oktober 2012.
Text Categorization.
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
1 Developing a Predictive Model for Internet Video Quality-of-Experience Athula Balachandran, Vyas Sekar, Aditya Akella, Srinivasan Seshan, Ion Stoica,
Chapter 5: Query Operations Hassan Bashiri April
BIOLOGY AUGUST 2013 OPENING ASSIGNMENTS. AUGUST 7, 2013  Question goes here!
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Squares and Square Root WALK. Solve each problem REVIEW:
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
© 2012 National Heart Foundation of Australia. Slide 2.
Universität Kaiserslautern Institut für Technologie und Arbeit / Institute of Technology and Work 1 Q16) Willingness to participate in a follow-up case.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Chapter 5 Test Review Sections 5-1 through 5-4.
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Music Recommendation by Unified Hypergraph: Music Recommendation by Unified Hypergraph: Combining Social Media Information and Music Content Jiajun Bu,
1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.
H to shape fully developed personality to shape fully developed personality for successful application in life for successful.
Januar MDMDFSSMDMDFSSS
REGISTRATION OF STUDENTS Master Settings STUDENT INFORMATION PRABANDHAK DEFINE FEE STRUCTURE FEE COLLECTION Attendance Management REPORTS Architecture.
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
PSSA Preparation.
Essential Cell Biology
1 Chapter 13 Nuclear Magnetic Resonance Spectroscopy.
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
Self-training with Products of Latent Variable Grammars Zhongqiang Huang, Mary Harper, and Slav Petrov.
CO-AUTHOR RELATIONSHIP PREDICTION IN HETEROGENEOUS BIBLIOGRAPHIC NETWORKS Yizhou Sun, Rick Barber, Manish Gupta, Charu C. Aggarwal, Jiawei Han 1.
1 Office of New Teacher Induction Introducing NTIMS New Teacher Induction Mentoring System A Tool for Documenting School Based Mentoring Mentors’ Guide.
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
Prediction Modeling for Personalization & Recommender Systems Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Basics of Statistical Estimation
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
Exploration & Exploitation in Adaptive Filtering Based on Bayesian Active Learning Yi Zhang, Jamie Callan Carnegie Mellon Univ. Wei Xu NEC Lab America.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
Date : 2013/1/10 Author : Lanbo Zhang, Yi Zhang, Yunfei Chen
Presentation transcript:

Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM) Lab University of California, Santa Cruz

Personalized Information Filtering Identify user-desired documents from a document stream Two families of filtering approaches – Collaborative Filtering (CF) – Content-Based Filtering (CBF) Applications: news feeder, spam filter, etc. 2 Filtering System News Blogs s Passed documents …

Semi-Structured Documents Increasingly prevalent over the Internet s, news, movies, tweets, etc. Plenty of metadata available 3

Definitions Facet: a metadata field – Date, Topic, Location, Director, Genre, etc. Facet-Value Pair (FVP): a metadata field assigned with a particular value – Topic: Royal wedding – Date: – Location: London, UK 4 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Motivation Existing filtering approaches learn user interests based on users relevance judgments of documents Users may have prior knowledge on which facet-value pairs are relevant – English-only readers Language: English – Social network analysts Company: Facebook 5 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

6 Can we exploit users prior knowledge on facet-value pairs for filtering? Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

A New User Interaction Mechanism: Faceted Feedback 7 Filtering System FVP candidates: Lang: … Topic: … Date: … Relevant FVPs: Topic: … Lang: …

Research Questions Question 1 – How to select facet-value pair candidates? Question 2 – How to learn user profiles based on faceted feedback? 8 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Q1: Possible Methods Feature selection methods for text classification – E.g., Mutual Information, Chi-Square measure, etc. Usually a large number of labeled documents available Query expansion methods for retrieval – E.g., TFIDF score on pseudo relevant documents No labeled documents available 9 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

FVP Selection: Our Approach In a filtering task – A large number of unlabeled documents – Possibly a small number of labeled documents We rank facet-value pairs by 10 Pseudo relevant (positively classified) documents User-labeled relevant documents Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz Intuition: features that occur frequently among relevant docs while rarely in the whole corpus are very likely to be relevant

Research Questions Question 1 – How to select facet-value pair candidates? Question 2 – How to learn user profiles based on faceted feedback? 11 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Content-Based Filtering (CBF) Treated as a binary text classification task User profile: a feature vector that represents a users information needs (interests/preferences) Given the user profile θ, a document can be determined as relevant or not according to: 12 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz Document vector Document label The core of CBF is learning the user profile!

Q2: Possible Methods Simple methods – Boolean strategy (AND, OR) – Feature selection – Pseudo relevant document Sophisticated methods – Bayesian logistic regression with an adjusted prior (Dayanik et al. 06) – Generalized Expectation Criteria (Druck et al. 08) 13 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Our Approach The assumption – A feature is selected by a user since it has a high correlation with the document label (R/NR) Generalized Constraint Model (GCM) 14 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Correlation Decomposition Sufficiency – The probability of a document being relevant given that the feature has occurred: P(R + |f=1) – P(R + |f=1)=1 : sufficient features E.g., Company: Facebook for social network analysts Necessity – The probability of the feature having occurred given that a document is relevant: P(f=1|R + ) – P(f=1|R + )=1 : necessary features E.g., Language: English for English-only readers 15 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Examples: Highly-Correlated Features 16 The whole corpus R+R+ f 2 =1 f 1 =1 f 3 =1 1) f 1 is a sufficient feature since P(R + |f 1 =1)=1 2) f 2 is a necessary feature since P(f 2 =1|R + )=1 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz 3) f 3 is neither necessary nor sufficient, but both its sufficiency and necessity are high (>0.5)

Estimating Sufficiency 17 Document label The feature The set of documents covered by feature f User profile vector Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz Estimation of the label of document d i

Estimating Necessity 18 Feature sufficiency Bayes Theorem! Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz Prior distribution

Reference Distributions Our assumption – User selects a feature since it has a high sufficiency and/or a high necessity Reference distributions: two Bernoulli distns – The sufficiency/necessity of a user-selected feature should be close to the reference distribution – KL-divergence for similarity measure 19 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

User Profile Learning The unified loss function to combine two types of feedback: 20 User-labeled documents Necessary features Sufficient features Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz T s, T n : reference distns

User Interaction Mechanisms Two mechanisms – Mechanism 1: ask users to select features they think are relevant – Mechanism 2: ask users to specifically select features they think are sufficient and necessary respectively 21Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Outline Introduction Faceted Feedback – Facet-Value Pair Candidate Selection – Learning from Faceted Feedback Experiments – Settings – Results Summary 22

Data Sets Use two data sets from TREC filtering track – TREC 2000: OHSUMED ( medical articles) + 63 topics (information needs) Metadata field: MeSH (Medical Subject Headings) – TREC 2002: RCV1 (~800,000 news articles) + 50 topics defined by human assessors Metadata fields: Topic, Industry, Region Split each topic set into two equal-size subsets – One for parameter tuning, the other for testing 23 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Faceted Feedback Collection Recruit subjects on Mechanical Turk – Five subjects per topic – The average performances will be reported For each topic, we show subjects – The topic description (information need) – A group of facet-value pair candidates 24 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Evaluation Metrics Precision (macro) Recall (macro) T11U = 2 * N rd – N nd – N rd : the number of relevant docs delivered – N nd : the number of non-relevant docs delivered T11SU = – MinNU = -0.5 – MaxU: the maximum possible utility (T11U) 25 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Outline Introduction Faceted Feedback – Facet-Value Pair Candidate Selection – Learning from Faceted Feedback Experiments – Settings – Results Summary 26

Results 1: w/wo Faceted Feedback (FF) 27 Faceted feedback improves filtering performances, especially when fewer relevant documents are initially known. Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz # relevant docs initially known

Results 2: Different Learning Algorithms 28 Our approach Existing approaches BOOL(A), BOOL(O): Boolean strategy FS: feature selection based on FF Pseudo-D/Q: pseudo relevant doc/query Prior: logistic regression with Bayesian prior GEC: generalized expectation criteria

Outline Introduction Faceted Feedback – Facet-Value Pair Candidate Selection – Learning from Faceted Feedback Experiments – Settings – Results Summary 29

Summary Faceted feedback is useful for filtering, especially in the cold-start scenarios The Generalized Constraint Model (GCM) is a robust user profile learning algorithm In future work, we will evaluate our methods on data sets where faceted features are more important – Movie, music, product, etc. 30 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Questions? 31 Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM) Lab University of California, Santa Cruz