Presentation is loading. Please wait.

Presentation is loading. Please wait.

SIMULTANEOUSLY MODELING SEMANTICS AND STRUCTURE OF THREADED DISCUSSIONS: A SPARSE CODING APPROACH AND ITS APPLICATIONS Chen LIN *, Jiang-Ming YANG +, Rui.

Similar presentations


Presentation on theme: "SIMULTANEOUSLY MODELING SEMANTICS AND STRUCTURE OF THREADED DISCUSSIONS: A SPARSE CODING APPROACH AND ITS APPLICATIONS Chen LIN *, Jiang-Ming YANG +, Rui."— Presentation transcript:

1 SIMULTANEOUSLY MODELING SEMANTICS AND STRUCTURE OF THREADED DISCUSSIONS: A SPARSE CODING APPROACH AND ITS APPLICATIONS Chen LIN *, Jiang-Ming YANG +, Rui CAI +, Xin- jing WANG +, Wei WANG *, Lei ZHANG + * Fudan University + Microsoft Research Asia 1

2 OUTLINE Motivation Challenges Model Application Reply reconstruction Junk post detection Expert finding Experiments Conclusion 2

3 THREADED DISCUSSIONS Mailing lists Chat rooms IMs Web forums 3 root reply

4 IMPORTANT DATA SOURCES Highly valuable knowledge & info. Various topics Millions of users 4

5 MINING SEMANTICS & STRUCTURE 5 Latent topics Avoid vocabulary mismatch Semantics Author reply relationship Utilize temporal post dependency Structure Junk Identification Expert Search Measure post quality …

6 CHALLENGE 6 Semantics & Structure Junk PostPost Quality

7 SEMANTIC & STRUCTURE 7 Semantic: Topics Structure: Who reply to who

8 CHALLENGE 8 Semantics & Structure Junk Post Post Quality

9 JUNK POST spamchitchatjunk 9

10 CHALLENGE 10 Semantics & StructureJunk Post Post Quality

11 POST QUALITY valuable post 11

12 MODEL Purpose: Simultaneously modeling semantics Structures Methodology Intuitive Matrix based Sparse coding root reply 12

13 INTUITION Thread topic Post topic Post reply Sparse reply 13

14 A THREAD HAS SEVERAL TOPICS 14

15 SEMANTIC REPRESENTATION OF THREAD D X Θ Minimize: post1post2…postL word1 word2 word3 … wordV topic1…topicT word1 word2 word3 … wordV post1post2…postL topic1 … topicT 15 Project posts to topic space

16 A POST IS RELATED TO PREVIOUS POSTS Minimize 16 post1post2…postL topic1 … topicT Θ b: approximate each post as linear combination of previous posts

17 A POST IS RELATED TO A FEW TOPICS government cobol 17

18 SPARSE SEMANTICS OF POST D X Θ Minimize: post1post2…postL word1 word2 word3 … wordV topic1…topicT word1 word2 word3 … wordV post1post2…postL topic1 … topicT 18

19 A POST IS RELATED TO A FEW POSTS Minimize 19 post1post2…postL topic1 … topicT Θ Sparse b: approximate each post as linear combination of previous posts

20 OPTIMIZE THEM TOGETHER Model semantic Model structure 20

21 APPLICATIONS Reply reconstruction Capability of recognizing structure Junk identification Capability of capturing semantics Expert finding Capability of measuring post quality 21

22 REPLY RECONSTRUCTION 22 Document Similarity Topic Similarity Structure Similarity

23 DATA SET SlashdotApple discussion 23 No.threads1154 No.posts203210 Avg.thread len.176.09 Avg.word/p73.53 Avg.post/user15.32 No.threads4488 No.posts80008 Avg.thread len.17.84 Avg.word/p78.36 Avg.post/user4.69

24 BASELINES NP Reply to Nearest Post RR Reply to Root DS Document Similarity LDA Latent Dirichlet Allocation Project documents to topic space SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space 24

25 EVALUATION methodSlashdotApple All PostsGood PostsAll PostsGood Posts NP0.0210.0120.2890.239 RR0.1830.3190.2690.474 DS0.4630.6430.4090.628 LDA0.4650.6440.4100.648 SWB0.4630.6440.4100.641 SMSS0.5240.7370.5170.772 25

26 JUNK IDENTIFICATION D= X = Θ = Probability of junk post1post2………postL word1 word2 word3 … wordV, topic1…topicTtopicbg word1 word2 word3 … wordV post1post2………postL topic1 … topicT topicbg 26

27 DATA SET SlashdotApple discussion 27

28 BASELINES 28 DF SVM Classify posts as junk posts & non-junk posts SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space

29 EVALUATION MethodPrecisionRecallF-measure SWB 0.48 0.220.30 SVM0.370.240.20 DF0.340.400.36 SMSS 0.38 0.450.41 29

30 EXPERT FINDING Reply reconstruction Network construction Expert finding Methods HITS PageRank … 30

31 BASELINES LM Formal Models for Expert Finding in Enterprise Corpora. SIGIR 06 Achieves stable performance in expert finding task using a language model PageRank Benchmark nodal ranking method HITS Find hub nodes and authority node EABIF Personalized Recommendation Driven by Information Flow. SIGIR 06 Find most influential node 31

32 EVALUATION 32 Bayesian estimate MethodMRRMAPP@10 LM0.8210.6980.800 EABIF(ori.)0.6740.3620.243 EABIF( rec.)0.7420.3180.281 PageRank(ori.)0.6750.3770.263 PageRank( rec.)0.7430.3210.266 HITS(ori.)0.906 0.832 0.900 HITS( rec.) 0.938 0.8220.906

33 DISCUSSION Parameters vs. Model Complexity Linear regression SMSS model Though the number of parameters is increased, the projection space is shrunk by the prior knowledge. 33 Prior knowledge

34 CONCLUSION Purpose Mine the semantics Mine the structure Highlight Simultaneously model the Semantic Structure Applications are designed to evaluate the model Reply reconstruction Junk identification Expert Finding 34

35 PERFORMANCE: PARAMETER 35

36 PERFORMANCE: PARAMETERS 36

37 OPTIMIZATION 37

38 OPTIMIZATION 38


Download ppt "SIMULTANEOUSLY MODELING SEMANTICS AND STRUCTURE OF THREADED DISCUSSIONS: A SPARSE CODING APPROACH AND ITS APPLICATIONS Chen LIN *, Jiang-Ming YANG +, Rui."

Similar presentations


Ads by Google