Download presentation
Presentation is loading. Please wait.
Published byAaron McKenna Modified over 10 years ago
1
SIMULTANEOUSLY MODELING SEMANTICS AND STRUCTURE OF THREADED DISCUSSIONS: A SPARSE CODING APPROACH AND ITS APPLICATIONS Chen LIN *, Jiang-Ming YANG +, Rui CAI +, Xin- jing WANG +, Wei WANG *, Lei ZHANG + * Fudan University + Microsoft Research Asia 1
2
OUTLINE Motivation Challenges Model Application Reply reconstruction Junk post detection Expert finding Experiments Conclusion 2
3
THREADED DISCUSSIONS Mailing lists Chat rooms IMs Web forums 3 root reply
4
IMPORTANT DATA SOURCES Highly valuable knowledge & info. Various topics Millions of users 4
5
MINING SEMANTICS & STRUCTURE 5 Latent topics Avoid vocabulary mismatch Semantics Author reply relationship Utilize temporal post dependency Structure Junk Identification Expert Search Measure post quality …
6
CHALLENGE 6 Semantics & Structure Junk PostPost Quality
7
SEMANTIC & STRUCTURE 7 Semantic: Topics Structure: Who reply to who
8
CHALLENGE 8 Semantics & Structure Junk Post Post Quality
9
JUNK POST spamchitchatjunk 9
10
CHALLENGE 10 Semantics & StructureJunk Post Post Quality
11
POST QUALITY valuable post 11
12
MODEL Purpose: Simultaneously modeling semantics Structures Methodology Intuitive Matrix based Sparse coding root reply 12
13
INTUITION Thread topic Post topic Post reply Sparse reply 13
14
A THREAD HAS SEVERAL TOPICS 14
15
SEMANTIC REPRESENTATION OF THREAD D X Θ Minimize: post1post2…postL word1 word2 word3 … wordV topic1…topicT word1 word2 word3 … wordV post1post2…postL topic1 … topicT 15 Project posts to topic space
16
A POST IS RELATED TO PREVIOUS POSTS Minimize 16 post1post2…postL topic1 … topicT Θ b: approximate each post as linear combination of previous posts
17
A POST IS RELATED TO A FEW TOPICS government cobol 17
18
SPARSE SEMANTICS OF POST D X Θ Minimize: post1post2…postL word1 word2 word3 … wordV topic1…topicT word1 word2 word3 … wordV post1post2…postL topic1 … topicT 18
19
A POST IS RELATED TO A FEW POSTS Minimize 19 post1post2…postL topic1 … topicT Θ Sparse b: approximate each post as linear combination of previous posts
20
OPTIMIZE THEM TOGETHER Model semantic Model structure 20
21
APPLICATIONS Reply reconstruction Capability of recognizing structure Junk identification Capability of capturing semantics Expert finding Capability of measuring post quality 21
22
REPLY RECONSTRUCTION 22 Document Similarity Topic Similarity Structure Similarity
23
DATA SET SlashdotApple discussion 23 No.threads1154 No.posts203210 Avg.thread len.176.09 Avg.word/p73.53 Avg.post/user15.32 No.threads4488 No.posts80008 Avg.thread len.17.84 Avg.word/p78.36 Avg.post/user4.69
24
BASELINES NP Reply to Nearest Post RR Reply to Root DS Document Similarity LDA Latent Dirichlet Allocation Project documents to topic space SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space 24
25
EVALUATION methodSlashdotApple All PostsGood PostsAll PostsGood Posts NP0.0210.0120.2890.239 RR0.1830.3190.2690.474 DS0.4630.6430.4090.628 LDA0.4650.6440.4100.648 SWB0.4630.6440.4100.641 SMSS0.5240.7370.5170.772 25
26
JUNK IDENTIFICATION D= X = Θ = Probability of junk post1post2………postL word1 word2 word3 … wordV, topic1…topicTtopicbg word1 word2 word3 … wordV post1post2………postL topic1 … topicT topicbg 26
27
DATA SET SlashdotApple discussion 27
28
BASELINES 28 DF SVM Classify posts as junk posts & non-junk posts SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space
29
EVALUATION MethodPrecisionRecallF-measure SWB 0.48 0.220.30 SVM0.370.240.20 DF0.340.400.36 SMSS 0.38 0.450.41 29
30
EXPERT FINDING Reply reconstruction Network construction Expert finding Methods HITS PageRank … 30
31
BASELINES LM Formal Models for Expert Finding in Enterprise Corpora. SIGIR 06 Achieves stable performance in expert finding task using a language model PageRank Benchmark nodal ranking method HITS Find hub nodes and authority node EABIF Personalized Recommendation Driven by Information Flow. SIGIR 06 Find most influential node 31
32
EVALUATION 32 Bayesian estimate MethodMRRMAPP@10 LM0.8210.6980.800 EABIF(ori.)0.6740.3620.243 EABIF( rec.)0.7420.3180.281 PageRank(ori.)0.6750.3770.263 PageRank( rec.)0.7430.3210.266 HITS(ori.)0.906 0.832 0.900 HITS( rec.) 0.938 0.8220.906
33
DISCUSSION Parameters vs. Model Complexity Linear regression SMSS model Though the number of parameters is increased, the projection space is shrunk by the prior knowledge. 33 Prior knowledge
34
CONCLUSION Purpose Mine the semantics Mine the structure Highlight Simultaneously model the Semantic Structure Applications are designed to evaluate the model Reply reconstruction Junk identification Expert Finding 34
35
PERFORMANCE: PARAMETER 35
36
PERFORMANCE: PARAMETERS 36
37
OPTIMIZATION 37
38
OPTIMIZATION 38
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.