Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

Similar presentations


Presentation on theme: "Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these."— Presentation transcript:

1 Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.orghttp://www.mmds.org

2 2 Nodes: Football Teams Edges: Games played Can we identify node groups? (communities, modules, clusters) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

3 3 NCAA conferences Nodes: Football Teams Edges: Games played J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

4 4 Can we identify functional modules? Nodes: Proteins Edges: Physical interactions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

5 5 Functional modules Nodes: Proteins Edges: Physical interactions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

6 6 Can we identify social communities? Nodes: Facebook Users Edges: Friendships J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

7 7 High school Summer internship Stanford (Squash) Stanford (Basketball) Social communities Nodes: Facebook Users Edges: Friendships J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

8  Non-overlapping vs. overlapping communities J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org8

9 9 Network Adjacency matrix Nodes J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

10  What is the structure of community overlaps: Edge density in the overlaps is higher! 10 Communities as “tiles” J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

11 11 This is what we want! Communities in a network J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

12  1) Given a model, we generate the network:  2) Given a network, find the “best” model J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org12 C A B D E H F G C A B D E H F G Generative model for networks

13  Goal: Define a model that can generate networks  The model will have a set of “parameters” that we will later want to estimate (and detect communities)  Q: Given a set of nodes, how do communities “generate” edges of the network? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org13 C A B D E H F G Generative model for networks

14  Generative model B(V, C, M, { p c }) for graphs:  Nodes V, Communities C, Memberships M  Each community c has a single probability p c  Later we fit the model to networks to detect communities 14 Model Network Communities, C Nodes, V Model pApA pBpB Memberships, M J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

15 15 Model Network Communities, C Nodes, V Community Affiliations pApA pBpB Memberships, M Think of this as an “OR” function: If at least 1 community says “YES” we create an edge J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

16 16 Model Network J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

17  AGM can express a variety of community structures: Non-overlapping, Overlapping, Nested 17J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

18

19  Detecting communities with AGM: 19 C A B D E H F G 1) Affiliation graph M 2) Number of communities C 3) Parameters p c J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

20 20

21 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org21

22 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org22 00.10 0.04 0.1000.020.06 0.100.0200.06 0.040.06 0 0100 1011 0101 0110 Flip biased coins

23  Given graph G(V,E) and Θ, we calculate likelihood that Θ generated G: P(G|Θ) 00.9 0 0 0 0 00 0 Θ= B(V, C, M, {p c }) 0110 1010 1101 0010 G P(G|Θ ) G 23 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org AB

24 24 P(|) AGM arg max

25 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org25

26 26 u v J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

27 27 j Communities Nodes

28 28 Node community membership strengths J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 01.200.2 0.5000.8 01.810

29 29J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

30 30J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

31 31J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

32  BigCLAM takes 5 minutes for 300k node nets  Other methods take 10 days  Can process networks with 100M edges! 32J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

33

34 34J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

35  Extension: Make community membership edges directed!  Outgoing membership: Nodes “sends” edges  Incoming membership: Node “receives” edges 35J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

36 36

37 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org37

38 38J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

39  Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach by J. Yang, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2013. Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach  Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2014. Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks  Community Detection in Networks with Node Attributes by J. Yang, J. McAuley, J. Leskovec. IEEE International Conference On Data Mining (ICDM), 2013. Community Detection in Networks with Node Attributes J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org39


Download ppt "Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these."

Similar presentations


Ads by Google