Presentation is loading. Please wait.

Presentation is loading. Please wait.

Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.

Similar presentations


Presentation on theme: "Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW."— Presentation transcript:

1 Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW 2006

2 Outline Introduction Introduction Related Works Related Works Community-User-Topic Models Community-User-Topic Models Semantic Community Discovery Semantic Community Discovery Experiments Experiments Conclusion Conclusion

3 Outline Introduction Introduction Related Work Related Work Community-User-Topic Models Community-User-Topic Models Semantic Community Discovery Semantic Community Discovery Experiments Experiments Conclusion Conclusion

4 Social Network Analysis (SNA) SNA is an established field in sociology SNA is an established field in sociology The goal of SNA The goal of SNA –Discovering interpersonal relationships based on various modes of information carriers, such as emails and the Web The community graph structure The community graph structure –How social actors gather into groups such that they are intra-group close and inter-group loose –An important characteristic of all SNs

5 Discovering Community from Email Corpora Typically the SN is constructed by measuring the intensity of contacts between email users. Typically the SN is constructed by measuring the intensity of contacts between email users. –An edge indicates a communication between two users is higher than certain frequency threshold –Problematic in some scenarios A spammer in the email system sends out a lot of messages A spammer in the email system sends out a lot of messages The lack of semantic interpretation The lack of semantic interpretation

6 Proposed Method The inner community property within SNs are examined by analyzing the semantic information such as emails The inner community property within SNs are examined by analyzing the semantic information such as emails A generative Bayesian network is used to model the generation of communication in an SN A generative Bayesian network is used to model the generation of communication in an SN Similarity among social actors are modeled as a hidden layer in the proposed probabilistic model Similarity among social actors are modeled as a hidden layer in the proposed probabilistic model

7 Outline Introduction Introduction Related Work Related Work Community-User-Topic Models Community-User-Topic Models Semantic Community Discovery Semantic Community Discovery Experiments Experiments Conclusion Conclusion

8 Related Work: Document Content Characterization Several factors, either observable or latent, are modeled as variables in the generative Bayesian network Several factors, either observable or latent, are modeled as variables in the generative Bayesian network Topic-Word model Topic-Word model –Documents are considered as a mixture of topics –Each topic corresponds to a multinomial distribution over words –Latent Dirichlet Allocation (LDA) [D. Blei et al., 2003]

9 Related Work (2) Author-Word model Author-Word model –The author x is chosen randomly from a d [A. McCallum, 1999] Author-Topic model Author-Topic model –Involves both the author and the topic –Perform well for document content characterization [M. Steyvers et al., 2004]

10 Outline Introduction Introduction Related Work Related Work Community-User-Topic Models Community-User-Topic Models Semantic Community Discovery Semantic Community Discovery Experiments Experiments Conclusion Conclusion

11 Community-User-Topic Models (CUT) Communication document Communication document –A document carrier of communication Basic idea Basic idea –The issue of a communication document indicates the activities of and is also conditioned on the community structure within an SN –Considering the community as an extra latent variable in the Bayesian network in addition to the author and topic variables

12 CUT 1 : Modeling Community with Users (1) Assume an SN community is more than a group of users Assume an SN community is more than a group of users –Similar to that assumed in a topology-based method –Treat each community as a multinomial distribution over users

13 CUT 1 : Modeling Community with Users (2) Compute the posterior probability P(c, u, z|w) by computing P(c, u, z, w) Compute the posterior probability P(c, u, z|w) by computing P(c, u, z, w) A possible side-effect of CUT 1 is it relaxes the community’s impact on the generated topics A possible side-effect of CUT 1 is it relaxes the community’s impact on the generated topics

14 CUT 2 : Modeling Community with Topics (1) An SN community consists of a set of topics An SN community consists of a set of topics CUT 2 differs from CUT 1 in strengthening the relation between community and topic CUT 2 differs from CUT 1 in strengthening the relation between community and topic

15 CUT 2 : Modeling Community with Topics (2) Similarly, compute P(c, u, z|w) by computing P(c, u, z, w) Similarly, compute P(c, u, z|w) by computing P(c, u, z, w) A possible side-effect of CUT 2 is it might lead to loose ties between community and users A possible side-effect of CUT 2 is it might lead to loose ties between community and users

16 Outline Introduction Introduction Related Work Related Work Community-User-Topic Models Community-User-Topic Models Semantic Community Discovery Semantic Community Discovery Experiments Experiments Conclusion Conclusion

17 Practical Algorithm: Gibbs Sampling Gibbs sampling is an algorithm to approximate the joint distribution of multiple variables by drawing a sequence of samples Gibbs sampling is an algorithm to approximate the joint distribution of multiple variables by drawing a sequence of samples Gibbs sampling is a Markov chain Monte Carlo algorithm and usually applies when the conditional probability distribution of each variable can be evaluated Gibbs sampling is a Markov chain Monte Carlo algorithm and usually applies when the conditional probability distribution of each variable can be evaluated

18 Gibbs Sampling for CUT

19 Estimation of the Conditional Probability Estimating P(c i, u i, z i |w i ) for CUT 1 and CUT 2 Estimating P(c i, u i, z i |w i ) for CUT 1 and CUT 2 CUT 1 : CUT 2 :

20 EnF-Gibbs: Gibbs Sampling with Entropy Filtering Non-informative words are ignored after A times of iterations

21 Outline Introduction Introduction Related Work Related Work Community-User-Topic Models Community-User-Topic Models Semantic Community Discovery Semantic Community Discovery Experiments Experiments Conclusion Conclusion

22 Experiment Setup Data: Enron email dataset Data: Enron email dataset –Made public by Federal Energy Regulatory Commission Fix the number of communities C at 6 and the number of topics T at 20 Fix the number of communities C at 6 and the number of topics T at 20 The smoothing hyper-parameters α, β and γ were set at 5/T, 0.01 and 0.1 respectively The smoothing hyper-parameters α, β and γ were set at 5/T, 0.01 and 0.1 respectively

23 Experiment Result-1 Table 1: Topics discovered by CUT 1 Table 2: Abbreviations

24 Experiment Result-2 Fig: Communities/topics of an employee

25 Experiment Result-3 Fig: A community discovered by CUT 2

26 Experiment Result-4 D..steffes = vice president of Enron in charge of government affairs Cara.semperger = a senior analyst Mike.grigsby = a marketing manager Rick.buy = chief risk management officer

27 Experiment Result-5 Similarity between two clustering results: Similarity between two clustering results: Fig: Community similarity comparisons

28 Experiment Result-6 Fig: Efficiency of EnF-Gibbs

29 Outline Introduction Introduction Related Work Related Work Community-User-Topic Models Community-User-Topic Models Semantic Community Discovery Semantic Community Discovery Experiments Experiments Conclusion Conclusion

30 Conclusion and Future Work Two versions of Community-User-Topic models are presented for community discovery in SNs. Two versions of Community-User-Topic models are presented for community discovery in SNs. EnF-Gibbs sampling is introduced by extending Gibbs sampling with entropy filtering EnF-Gibbs sampling is introduced by extending Gibbs sampling with entropy filtering Experiments show that the proposed method effectively tags communities with topic semantics Experiments show that the proposed method effectively tags communities with topic semantics It would be interesting to explore the predictive performance of these models on new communications between strange social actors in SNs It would be interesting to explore the predictive performance of these models on new communications between strange social actors in SNs

31 Illustration of Dirichlet Distribution Several images of the probability density of the Dirichlet distribution when K=3 for various parameter vectors α. Clockwise from top left: α=(6, 2, 2), (3, 7, 5), (6, 2, 6), (2, 3, 4).


Download ppt "Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW."

Similar presentations


Ads by Google