Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

Similar presentations


Presentation on theme: "Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these."— Presentation transcript:

1 Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site:

2 2 Nodes: Football Teams Edges: Games played Can we identify node groups? (communities, modules, clusters) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

3 3 NCAA conferences Nodes: Football Teams Edges: Games played J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

4 4 Can we identify functional modules? Nodes: Proteins Edges: Physical interactions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5 5 Functional modules Nodes: Proteins Edges: Physical interactions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

6 6 Can we identify social communities? Nodes: Facebook Users Edges: Friendships J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

7 7 High school Summer internship Stanford (Squash) Stanford (Basketball) Social communities Nodes: Facebook Users Edges: Friendships J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

8  Non-overlapping vs. overlapping communities J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

9 9 Network Adjacency matrix Nodes J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

10  What is the structure of community overlaps: Edge density in the overlaps is higher! 10 Communities as “tiles” J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

11 11 This is what we want! Communities in a network J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

12  1) Given a model, we generate the network:  2) Given a network, find the “best” model J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, C A B D E H F G C A B D E H F G Generative model for networks

13  Goal: Define a model that can generate networks  The model will have a set of “parameters” that we will later want to estimate (and detect communities)  Q: Given a set of nodes, how do communities “generate” edges of the network? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, C A B D E H F G Generative model for networks

14  Generative model B(V, C, M, { p c }) for graphs:  Nodes V, Communities C, Memberships M  Each community c has a single probability p c  Later we fit the model to networks to detect communities 14 Model Network Communities, C Nodes, V Model pApA pBpB Memberships, M J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

15 15 Model Network Communities, C Nodes, V Community Affiliations pApA pBpB Memberships, M Think of this as an “OR” function: If at least 1 community says “YES” we create an edge J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

16 16 Model Network J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

17  AGM can express a variety of community structures: Non-overlapping, Overlapping, Nested 17J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

18

19  Detecting communities with AGM: 19 C A B D E H F G 1) Affiliation graph M 2) Number of communities C 3) Parameters p c J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

20 20

21 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

22 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Flip biased coins

23  Given graph G(V,E) and Θ, we calculate likelihood that Θ generated G: P(G|Θ) Θ= B(V, C, M, {p c }) G P(G|Θ ) G 23 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, AB

24 24 P(|) AGM arg max

25 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

26 26 u v J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

27 27 j Communities Nodes

28 28 Node community membership strengths J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

29 29J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

30 30J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

31 31J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

32  BigCLAM takes 5 minutes for 300k node nets  Other methods take 10 days  Can process networks with 100M edges! 32J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

33

34 34J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

35  Extension: Make community membership edges directed!  Outgoing membership: Nodes “sends” edges  Incoming membership: Node “receives” edges 35J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

36 36

37 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

38 38J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

39  Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach by J. Yang, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach  Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks  Community Detection in Networks with Node Attributes by J. Yang, J. McAuley, J. Leskovec. IEEE International Conference On Data Mining (ICDM), Community Detection in Networks with Node Attributes J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,


Download ppt "Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these."

Similar presentations


Ads by Google