Download presentation
Presentation is loading. Please wait.
1
Automatic Collection “Recruiter” Shuang Song
2
Project Goal Given a collection, automatically suggest other items to add to the collection Design a process to achieve the task Apply different filtering algorithms Evaluate the result
3
The Process Tokenization and frequency counting New items extraction New items filtering and ranking Query Terms Filter Collection External Source 1 2 3 Query Results Training Sets New Items
4
Filtering Algorithms Latent Semantic Analysis (LSA) Pre-processing, no stemming SVD over term by document matrix Pseudo-document representation of new items Gzip Compression Algorithms
5
Relevance Measure - LSA LSA Feature Space Collection Signature Vector Pseudo-document Vector V* V
6
Relevance Measure - gzip
7
First Experiment – Math Forum Collection 19 courseware in the collection 10 items in the experiment set First 5 from math forum The other 5 from other collections in www.smete.org
8
First Experiment Result
9
Second Experiment – Collaborative Filtering Collection 12 papers in the collection 11 items in the experiment set First 10 from Citeseer Query terms submitted: (information 284) (algorithm 250) (ratings 217) (filtering 159) (system 197) (query 149) (reputation 114) (reviewer 109) (collaborative 106) (recommendations 98) Last one is the paper we read in class: “An Algorithm for Automated Rating of Reviewers”
10
Second Experiment Result
11
Second Experiment – User Study 6 people in my research lab participated in this study 3 of them with IR background 3 of them without IR background They were asked to rate the 11 items in the experiment set in according to the the degree of relevance to the given collection
12
Second Experiment Result – Human Rating
13
Second Experiment Result – Another View Document ID LSAgzip Group with IR background Group without IR background 1MMLL 2HHHH 3LLLM 4HLHM 5LHHH 6HMHM 7MLHH 8HLHH 9LMLL 10MHHH 11LHHM
14
Second Experiment Result – comparison of w/o SVD and w/o weightings
15
Second Experiment – Correlation with human rating
16
Second Experiment – precision and recall (cutoff: R LSA >0.5 & R gzip >0.2)
17
Second Experiment – precision and recall (cutoff: R LSA >0.4 & R gzip >0.17)
18
Comparison of Two Filtering Algorithms Gzip works well when input documents are just abstracts, while LSA works for both LSA captures words association pattern and statistical importance, gzip scans for repetition only. LSA is more computationally demanding, while gzip is simple Effectiveness
19
To Do List And Future Work Accurate and trustworthy evaluation from expert (collection owner?) Extract full text and abstract from Citeseer automatically
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.