Group Formation in Large Social Networks: Membership, Growth, and Evolution Lars Backstrom, Dan Huttenlocher, Joh Kleinberg, Xiangyang Lan Presented by.

Slides:

Advertisements

Similar presentations

Mobile Communication Networks Vahid Mirjalili Department of Mechanical Engineering Department of Biochemistry & Molecular Biology.

Advertisements

Experiments and Variables

LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.

C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.

Decision Tree Approach in Data Mining

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα Strong and Weak Ties Chapter 3, from D. Easley and J. Kleinberg book.

Analysis and Modeling of Social Networks Foudalis Ilias.

Social Media Mining Chapter 5 1 Chapter 5, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool, September, 2010.

(Social) Networks Analysis I

Link creation and profile alignment in the aNobii social network Luca Maria Aiello et al. Social Computing Feb 2014 Hyewon Lim.

UNDERSTANDING VISIBLE AND LATENT INTERACTIONS IN ONLINE SOCIAL NETWORK Presented by: Nisha Ranga Under guidance of : Prof. Augustin Chaintreau.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Ensemble Learning: An Introduction

Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.

Scientific method - 1 Scientific method is a body of techniques for investigating phenomena and acquiring new knowledge, as well as for correcting and.

Computer Science 1 Web as a graph Anna Karpovsky.

Ensemble Learning (2), Tree and Forest

Decision Tree Models in Data Mining

A Measurement-driven Analysis of Information Propagation in the Flickr Social Network WWW09 报告人：徐波.

Models of Influence in Online Social Networks

CS105 Introduction to Social Network Lecture: Yang Mu UMass Boston.

Section 2: Science as a Process

Random Graph Models of Social Networks Paper Authors: M.E. Newman, D.J. Watts, S.H. Strogatz Presentation presented by Jessie Riposo.

Analysis and Modeling of the Open Source Software Community Yongqin Gao, Greg Madey Computer Science & Engineering University of Notre Dame Vincent Freeh.

POSC 202A: Lecture 1 Introductions Syllabus R Homework #1: Get R installed on your laptop; read chapters 1-2 in Daalgard, 1 in Zuur, See syllabus for Moore.

Using Friendship Ties and Family Circles for Link Prediction Elena Zheleva, Lise Getoor, Jennifer Golbeck, Ugur Kuter (SNAKDD 2008)

Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial

«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Patterns And A Generative Model Jan 24, 2014 Authors: Jianwei Niu, Wanjiun Liao, Jing Peng, Chao Tong Presenter: Guoming Wang Published: Performance Computing.

Social Networking and On-Line Communities: Classification and Research Trends Maria Ioannidou, Eugenia Raptotasiou, Ioannis Anagnostopoulos.

Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.

Online Social Networks Daniel Huttenlocher John P. and Rilla Neafsey Professor of Computing, Information Science and Business.

Free Powerpoint Templates Page 1 Free Powerpoint Templates Influence and Correlation in Social Networks Azad University KurdistanSocial Network.

by B. Zadrozny and C. Elkan

Exploring the dynamics of social networks Aleksandar Tomašević University of Novi Sad, Faculty of Philosophy, Department of Sociology

Using Transactional Information to Predict Link Strength in Online Social Networks Indika Kahanda and Jennifer Neville Purdue University.

To Blog or Not to Blog: Characterizing and Predicting Retention in Community Blogs Imrul Kayes 1, Xiang Zuo 1, Da Wang 2, Jacob Chakareski 3 1 University.

Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science.

Uncovering Overlap Community Structure in Complex Networks using Particle Competition Fabricio A. Liang

An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.

Feedback Effects between Similarity and Social Influence in Online Communities David Crandall, Dan Cosley, Daniel Huttenlocher, Jon Kleinberg, Siddharth.

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα Networks and Surrounding Contexts Chapter 4, from D. Easley and J. Kleinberg book.

Decision Trees. MS Algorithms Decision Trees The basic idea –creating a series of splits, also called nodes, in the tree. The algorithm adds a node to.

Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.

Freelib: A Self-sustainable Digital Library for Education Community Ashraf Amrou, Kurt Maly, Mohammad Zubair Computer Science Dept., Old Dominion University.

Introduction to Earth Science Section 2 Section 2: Science as a Process Preview Key Ideas Behavior of Natural Systems Scientific Methods Scientific Measurements.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.

The Scientific Method. Objectives Explain how science is different from other forms of human endeavor. Identify the steps that make up scientific methods.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.

Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.

Most of contents are provided by the website Network Models TJTSD66: Advanced Topics in Social Media (Social.

IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.

DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.

HCC class lecture 21: Intro to Social Networks John Canny 4/11/05.

11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.

Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.

Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.

Exploring Social Influence via Posterior Effect of Word-of-Mouth Recommendations Junming Huang, Xue-Qi Cheng, Hua-Wei Shen, Tao Zhou, Xiaolong Jin WSDM.

The normal approximation for probability histograms.

Topics In Social Computing (67810) Module 1 Introduction & The Structure of Social Networks.

Using a model of Social Dynamics to Predict Popularity of News Kristina Lerman, Tad Hogg USC Information Sciences Institute, Institute for Molecular Manufacturing.

THE BIBLIOMETRIC INDICATORS. BIBLIOMETRIC INDICATORS COMPARING ‘LIKE TO LIKE’ Productivity And Impact Productivity And Impact Normalization Top Performance.

The simultaneous evolution of author and paper networks

Link Prediction Class Data Mining Technology for Business and Society

Graph clustering to detect network modules

User Joining Behavior in Online Forums

Using Friendship Ties and Family Circles for Link Prediction

The Importance of Communities for Learning to Influence

GhostLink: Latent Network Inference for Influence-aware Recommendation

Presentation transcript:

Group Formation in Large Social Networks: Membership, Growth, and Evolution Lars Backstrom, Dan Huttenlocher, Joh Kleinberg, Xiangyang Lan Presented by Natalia Dragan Data Mining Techniques(CS6/ ) Fall06 Kent State University

Outline Introduction Motivation Present work contributions Related work Membership, Growth, Evolution Conclusions

Introduction Structure of society: people tend to come together and form groups Why it is important to study  To understand better decision-making behavior  Tracking early stages of an epidemic  Following popularity of new ideas and technologies Significant growth in the scale and richness of online communities and social media (MySpace, Face book, LiveJournal, Flickr)

Problems with the studying social processes Easy to build theoretical models (subgraph branching out rapidly over links in the network or collection of small disconnected components growing in a ‘speckled’ fashion) Hard to make concrete empirical statements about these types of processes Lack of reasonable vocabulary for studying group evolution

Present work Principles by which groups develop and evolve in large-scale social networks Crucial point: focusing on networks where the members have explicitly identified themselves as a group or community  (vs. an unsupervised graph clustering problem of inferring “community structures” in a network) 3 main types of questions: Membership, Growth, Change

Membership, Growth, Change Membership  Structural features that influence whether a given individual will join a particular group Growth  Structural features that influence whether a given group will grow significantly over time Change  How focus of interest changes over time  How these changes are correlated with changes in the set of group members

Related work Identifying tightly-connected clusters within a given graph  Dill et al. consider implicitly identified communities For a set of features (ZIP code, particular keyword, etc.) they consider a subgraph of the Web consisting of all pages containing this feature Use of online social networks for data mining  The structure of the communities is not exploited Diffusion on innovation study  Property which is “diffusing” in the present work – membership in a given group

Diffusion of Innovations Diffusion of Innovations is a theory that analyzes, as well as helps explain, the adaptation of a new innovation. In other words it helps to explain the process of social change. An innovation is an idea, practice, or object that is perceived as new by an individual or other unit of adoption. The perceived newness of the idea for the individual determines his/her reaction to it (Rogers, 1995). In addition, diffusion is the process by which an innovation is communicated through certain channels over time among the members of a social system. Thus, the four main elements of the theory are the innovation, communication channels, time, and the social system.

Related work on Diffusion of Innovation How a social network evolves as it’s members attributes change (Sarkar and Moore, Holme and Newman) Social network evolution in a university setting (Kossinets and Watts) Evolution of topics over time (Wang and McCallum) Property which is “diffusing” in the present work is a membership in a given group

Road map Introduction Motivation Present work contributions Related work Membership, Growth, Evolution Conclusions

Sources of data LiveJournal  Free on-line community with ~ 10 mln members  300,000 update the content in 24-hour period  Maintaining journals, individual and group blogs  Declaring who are their friends and to which communities they belong DBLP  On-line database of computer science publications (about 400,000 papers)  Friendship network – co-authors in the paper  Conference - community

Community Membership Study of processes by which individuals join communities in a social network Fundamental question about the evolution of communities: who will join in the future? Membership in a community – “behavior” that spreads through the network  Diffusion of innovation study perspective for this question

Dependence on number of friends: start towards membership prediction Underlying premise in diffusion studies: an individual probability of adopting a new behavior increases with the number of friends (K) already engaging in the behavior Theoretical models concentrate on the effect of K, while the structural properties are more influential in determining membership

Approach hypothesis For moderate values of K an individual with K friends in a group is significantly more likely to join if these K fiends are themselves mutual friends than if there are not

Dependence on number of friends st snapshot 2 nd snapshot... C C. K = P(k) = 2/3.. Probability P(k) of joining community = fraction of triples (u,C,k) - user (u), C - community, - friend..

Law of Diminishing returns In economics, diminishing returns is the short form of diminishing marginal returns. In a production system, having fixed and variable inputs, keeping the fixed inputs constant, as more of a variable input is applied, each additional unit of input yields less and less additional output. This concept is also known as the law of increasing opportunity cost or the law of diminishing returns.

Dependence on number of friends: LiveJournal

Dependence of number of friends: DBLP

Discussion of results The plots for LJ and DBLP have similar shapes, dominated by “diminishing returns” property (curve continues increasing, but more and more slowly even for large K) P(2)>2P(1) – benefit of having a second friend is particularly strong (S-shaped behavior) Curve for LJ is quite smooth (1/2 billion triples vs. 7.8 million for DBLP)

A broader range of features Features related to the community C (11)  Number of members (|C|)  Number of individuals with a friend in C (fringe of C)  Number of edges with both ends in the community (|E c |)  etc. Features related to an individual u and her set S of friends in community C (8)  Number of friend in community (|S|)  Number of adjacent pairs in S  Number of pairs in S connected via a path in E c  etc.

Estimating probability on a broader range of features Decision-tree techniques were applied to these features to make advances in estimating the probability of an individual joining a community The technique incorporates  Individual’s position in the network (structural features)  Level of activities among members (group features)

Predictions for LJ and DBLP 1 st snapshot 2 nd snapshot C Fringe C u Data point (u,C) Probability U  C LJ: 17,076,344 data points, 875 communities DBLP: 7,651,013 data points LJ: 14,448 joined community DBLP: 71,618 joined community 20 decisions tree were built for estimation about joining

Splitting process for LJ Each of 875 communities have half of their fringe members included in the training set (with the independent probability 0.5) At each node in the decision tree  Every possible feature  Every binary split treshold for that feature were examined Of all such pairs the split which produces the largest decrease in entropy was chosen

Splitting process for LJ New splits were installed until there are fewer than 100 positive cases at the node Leaf nodes predict the ratio of positive to total cases for that node Averaging technique  For every case the set of decision trees, for which this case is not included in the training set, were built  The average of these predictions is a prediction for the case

Averaging model (Simple description) Selecting a model that explains the data from all the possible models, the one which better fits the data is usually selected. But sometimes there is some model that explains really well the data, creating a model selection uncertainty, which is usually ignored. BMA (Bayesian Model Averaging) provides a coherent mechanism for accounting for this model uncertainty, combining predictions from the different models according to their probability. J. A.and Madigan D. Hoeting and A.E.and Volinsky C.T. Raftery. Bayesian model averaging: A tutorial (With Discussion). Statistical Science, 44(4): , 1999

Averaging model (Simple description) Example: we have an evidence D and 3 possible hypothesis h1, h2 and h3. The posterior probabilities for those hypothesis are P( h1 | D ) = 0.4, P( h2 | D ) = 0.3 and P( h3 | D ) = 0.3  Giving a new observation, h1 classifies it as true and h2 and h3 classify it as false, then the result of the global classifier (BMA) would be calculated as follows:

Top two level splits for predicting single individuals joining communities in LJ

Performance achieved with the decision trees Prediction performance for single individuals joining communities in LJ Prediction performance for single individuals joining communities in DBLP

Internal connectedness of friends Individuals whose friends in community are linked to one another are significantly more likely to join the community

Road map Introduction Motivation Present work contributions Related work Membership, Growth, Evolution Conclusions

Community Growth Prediction task: identifying which communities will grow significantly over a given period of time Binary classification problem Training set Community growth Class 0 (50.6) Class 1 (49.4%) >9 % < 18 % Data set: communities To make predictions: 100 decision trees on 100 independent samples using the community features were built Binary split is installed until a node has less than 50 data points

Top two levels of decision tree splits for predicting community growth in LJ The features and splits varied depending on the sample, but the top 2 splits were stable

Solution to the problem Averaging tree techniques was used Three baselines with a single feature were considered  Size of the community  Number of people in the fringe of the community  Ratio of these two features and combination of all three features

Results Predicting community growth: baselines based on three different features, and performance using all features By including the full set of features predictions with reasonably good performance were received

Road map Introduction Motivation Present work contributions Related work Membership, Growth, Evolution Conclusions

Movement between communities How people and topics move between communities Fundamental question: given a set of overlapping communities  do topics tend to follow people  or do people tend to follow topics Experiment set up: 87 conferences for which there is DBLP data over at least 15-year period  Cumulative set of words in titles is a proxy for top-level topics

Time Series and Detected Bursts: Term Bursts OOPSLA’ “Micro-Pattern Evolution” “Micro-Pattern in Java” T w,C(y) = 2/ y T w,C Term bursts “Micro-Pattern” is hot at OOPSLA in 2003

Time Series and Detected Bursts: Term Bursts T w,C (y) – fraction of paper titles at conference C in year y that contain the word w Bursts in the usage of w  For each time series T w,C is an interval in which T w,C (y) is twice the average rate (“burst rate”)  Burst detection technique exploiting stochastic model for term generation is used Burst intervals serve to identify the “hot topics” (focus of interest at a conference) Word w is hot at conference C in year y if the year y is contained in a burst interval of the time series T w,C

Time Series and Detected Bursts: Movement Bursts Author movement  Authors do not publish every year  Movement is asymmetric Member of a conference C in year y  Has published there in the 5 years leading up to y Author a moves into C from B in year y (B -> C)  a has a paper in conference C in year y and  a is a member of B in year y-1  Property of two conferences and a year B C Smith.. “Micro-Pattern Evolution”

Time Series and Detected Bursts: Movement Bursts B Dill M B,C (y) = 2/ y M B,C Movement bursts Brown C

Time Series and Detected Bursts: Movement Bursts M B,C (y) – fraction of authors at conference C in year y with the property that they are moving into C from B (B -> C) M B,C – time series representing author movement B -> C movement bursts  an interval of y in which the value M B,C (y) exceeds the overall average by an absolute difference of.10 Burst detection is used to find burst intervals

Goal of the Experiment 1 Identify how word burst and movement burst intervals are aligned in time? Word burst intervals identify hot terms Movement burst intervals identify conference pairs B,C during which there was significant movement

Experiment 1: Papers contributing to Movement Bursts Characteristics of papers associated with some movement burst into a conference C  They exhibit different properties from arbitrary papers at C Using of terms currently hot at C Using of terms that will be hot at C in the future Paper at C in y contributes to some movement burst at C  If one of the authors is moving B -> C in y  y is a part of B -> C movement bursts Movement burst.. ICPC’02 OOPSLA’ “Micro-pattern Evolution” Smith

Papers contributing to Movement Bursts Paper uses hot term  If one of the words in its title is hot for the conference and year in which it appears Question: do papers contributing to movement bursts differ from arbitrary papers in the way they use hot terms? Papers contributing to a movement burst contain elevated frequencies of currently and expired hot terms, but lower frequencies of future hot terms A burst of authors moving into C from B are drawn to topics currently hot at C

Experiment 2: Alignment between different conferences Conferences B and C are topically aligned in a year y  If some word is hot at both B and C in year y  Property of two conference and a specific year Hypothesis: two conferences are more likely to be topically aligned in a given year if there is also a movement burst going between them “Micro-pattern” OOPSLA’03 ICSM’03

Results 56.34% of all triples (B,C,y) such that there is B->C movement burst containing year y have the property that B and C are topically aligned in year y 16.2 % of all triples (B,C,y) have the property that B and C are topically aligned in year y The presence of a movement burst between 2 conferences enormously increases the chance they share a hot term

Movement bursts or term bursts come first? There is a B -> C movement burst, and hot terms w such that B and C are topically aligned via w in some year y inside the movement burst 3 events of interest  The start of the burst for w at conference B  The start of the burst for w at conference C  The start of the B -> C movement burst

Four patterns of author movement and topical alignment B -> C movement burst Term burst intervals Shared interest is 50 % more frequent than others Much more frequent for B and C to have a shared burst term that is already underway before the increase in author movement takes place

Conclusions The ways in which communities in social network grow over time were considered  At the level of individuals and their decision to join communities  At a more global level, in which a community can evolve in membership and content

Thank you!