Presentation is loading. Please wait.

Presentation is loading. Please wait.

1. Social-Network Analysis Using Topic Models 2. Web Event Topic Analysis by Topic Feature Clustering and Extended LDA Model RMBI4310/COMP4332 Big Data.

Similar presentations


Presentation on theme: "1. Social-Network Analysis Using Topic Models 2. Web Event Topic Analysis by Topic Feature Clustering and Extended LDA Model RMBI4310/COMP4332 Big Data."— Presentation transcript:

1 1. Social-Network Analysis Using Topic Models 2. Web Event Topic Analysis by Topic Feature Clustering and Extended LDA Model RMBI4310/COMP4332 Big Data Mining: Presentation CHAN Chung Yin 07547265

2 Topic modeling statistical methods that analyze the words of the original texts to – discover the themes that run through them – how those themes are connected to each other – how they change over time

3 Latent Dirichlet Allocation (LDA) The simplest, the most common and basic Idea: – Represents documents as mixtures of topics – Defines collections of words and find a set of topics that are likely to have generated the collection – I like to eat broccoli and bananas. – I ate a banana and spinach smoothie for breakfast. – Chinchillas and kittens are cute. – My sister adopted a kitten yesterday. – Look at this cute hamster munching on a piece of broccoli. Sentences 1 and 2: 100% Topic A Sentences 3 and 4: 100% Topic B Sentence 5: 60% Topic A, 40% Topic B  Topic A (food): 30% broccoli, 15% bananas, 10% breakfast, 10% munching, …  Topic B(cute animals): 20% chinchillas, 20% kittens, 20% cute, 15% hamster, …

4 Latent Dirichlet Allocation (LDA) Learning: – Go through each document (d), and randomly assign each word (w) in the document to one of the K topics (t). – Go through each word in a document Calculate, for each topic t, – p(topic t | document d) = the proportion of words in document d that are currently assigned to topic – p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w – Reassign w a new topic p(topic t | document d) * p(word w | topic t) Assume that all topic assignments except for the current word in question are correct – Repeat above step until steady state

5 Applying LDA to Social Network Analysis “follow relationship” among users often looks unorganized and chaotic follow relationships are created haphazardly by each individual user and not controlled by a central entity. Provide more structure to this follow relationship – by “grouping” the users based on their topic interests – by “labeling” each follow relationship with the identified topic group Background Purpose

6 Applying LDA to Social Network Analysis Represent E in matrix form by putting F in the rows and G in the columns Symbol Meaning uA Twitter user UA set of all u fA follower FA set of all f gA followed user GA set of all g zA topic (interest) ZA set of all z e(f, g)A follow edge from f to g e'(f, g)A follow edge from g to f e(f)A set of all outgoing edges from f e'(g)A set of all incoming edges to g E A set of all e(f, g) ∀ f ∈ F, ∀ g ∈ G

7 Applying LDA to Social Network Analysis Similar LDAEdge Generative Model TopicInterest (z) DocumentFollower (f) WordFollowed user (g)

8 Applying LDA to Social Network Analysis Different LDAEdge Generative Model multinomial distributionmultivariate hypergeometric distribution documents are in the rows and words are in the columns rows and the columns are from the same entity integerbinary Smaller magnitudeLarger magnitude (|f|X|g|) Power-law distributionDo not show power-law distribution

9 Handling Popular Users Alternative 1: Setting Asymmetric Priors – Over-fitting: Dirichlet priors and constrain P(z|f) and P(g|z), respectively – In the standard LDA, each element of and is assumed to have the same value – : Distribution of followed user per-interest – we set each prior value proportional to the number of incoming edges of each followed user.

10 Handling Popular Users Alternative 2: Hierarchical LDA – Generate a topic hierarchy and more frequent topics are located at higher levels – bottom level topics are expected to be more specific – Infinite topic number – Topics are not independent

11 Handling Popular Users Alternative 3: Two-Step Labeling – Topic establishment step We run LDA after removing popular users from the dataset similar to how we remove stop words before applying LDA to a document corpus. – Labeling step we apply LDA only to popular users in the dataset

12 Handling Popular Users Alternative 4: Threshold Noise Filtering – the smallest number of times a popular user is assigned to a topic group could outnumber the largest number of times a non-popular user is assigned to that topic group. – Set a cut-off value to determine whether to label a user with each topic – filter out less relevant topics by keeping only the top-K topics for each user

13 Experiments Twitter between October 2009 and January 2010 Sampled 10 million edges Dataset Value |E|10,000,000 |U|2,430,237 |F|14,015 |G|2,427,373 max(|e(f)|)142,669 (f: zappos) max(|e’(g)|)7,410 (g: barackobama) Statistics

14 Experiments 100 topic groups Boundary value V=100 Experimental cases Normal user group|e’(g)|<= V Popular user group|e’(g)| > V CaseExperiment Description baseLDA over the whole group dataset non-popularLDA over the normal user group dataset betaLDA with asymmetric priors hlda-2lvHLDA with 2 levels hlda-3lvHLDA with 3 levels 2stepTwo-step labeling filter-baseThreshold noise filtering after base (C=0.05) filter-2stepThreshold noise filtering after 2step (C=0.05)

15 Experiments Measurement 1: Perplexity – how well the trained model deals with an unobserved test data – A perplexity means that you are as surprised on average as you would have been if you had had to guess between k equiprobable choices at each step. – Smaller perplexity, better the model is

16 Experiments Measurement 1: Perplexity – HLDA shows significantly worse Perplexity than others.

17 Experiments Measurement 2: Quality – Determine by how users perceive the quality of the identified topic groups from each approach – Conduct a survey with a total of 14 participants

18 Experiments Measurement 3: Qualitative Comparison – The top ten users in a topic group – By going over the users' bios, we see that many of the users in the group have the same interest

19 Experiments Measurement 3: Qualitative Comparison – Base: Topic group cycle from base showing the popular user issue

20 Experiments Measurement 3: Qualitative Comparison – Base Barack Obama appears in 15 groups out of 100 topic groups produced by base. Among the 15 groups, only one of them is related to his specialty, politics the standard LDA suffers from the noise from popular users if applied directly to our social graph dataset

21 Experiments Measurement 3: Qualitative Comparison – non-popular: Topic group mobile gadget blog from non-popular

22 Experiments Measurement 3: Qualitative Comparison – 2step : Topic group corresponding to mobile gadget blog many users in this group are very popular, tech media

23 Experiments Measurement 3: Qualitative Comparison – Filter-2step : Topic group corresponding to mobile gadget blog many users in this group are very popular, tech media cnnbrk and breakingnews are removed

24 Conclusion Two-step labeling with threshold noise filtering, are very effective in handling this popular user issue showing 1.64 times improvement in the quality of the produced topic groups, compared to the standard LDA model.

25 Web Event Topic Analysis Analyze a common topic of many different web events Cluster events which belong to a topic Choose suitable topic terms for clusters

26 Web Event Topic Analysis Dimension LDA (DLDA) – Represent an event as a multi-dimensions vector – {agent, activity, object, time, location, cause, purpose, manner} K-means Clustering

27 Web Event Topic Analysis

28 Dimension LDA (DLDA) Perform LDA analysis Add a parameters vector on the weighting of each dimension Select some dimensions as topic feature dimensions

29 Topic Feature Clustering Cluster the content of topic feature dimensions to analyze the common topic for events

30 Topic Feature Words Selection Use DLDA model to select topic feature words Find the words within same cluster group By maximize the probability of topic given dimension and word

31 Topic Terms Generating Merge 2 words together Replace existing keyword from dictionary => candidate topic terms Finally, Compute probability distribution of topic candidate terms

32 Results Using Topic Feature Clustering with Extended LDA Model is better than using clustering or LDA model alone


Download ppt "1. Social-Network Analysis Using Topic Models 2. Web Event Topic Analysis by Topic Feature Clustering and Extended LDA Model RMBI4310/COMP4332 Big Data."

Similar presentations


Ads by Google