Collective Network Linkage across Heterogeneous Social Platforms International Conference on Data Mining, Atlantic City, NJ, USA Ming Gao Institute for Data Science and Engineering East China Normal University, Shanghai, China
Our Co-authors Ee-Peng Lim David Lo Feida Zhu (Singapore Management University) David Lo Feida Zhu Philips Kokoh Prasetyo (Singapore Management University) Aoying Zhou (East China Normal University)
Roadmap Related Work Solution Empirical Study Conclusions Background
Background Social media sites have become extremely popular in recent years People maintain their accounts and social connections with different social media sites simultaneously Major applications of network linkage: Across different social networks Profile user Understand user behavior Recommend users or products across networks … On a single social network Detect duplicates from the single network
Related Work Network linkage across different social networks There is no unsupervised approach which is not domain-specific and can also handle missing and incomplete data. These are the focuses of this work.
Network Linkage VS. Record Linkage Object Social user Relational data Attribute Heterogeneous Simple Implicit and explicit Explicit Unfixed Fixed Missing data Many Few Relationship True False Challenges Heterogeneous attributes Noise or missing data in user attributes Social connection across heterogeneous networks Many pairs for consideration
Formulation Two networks: A and B The set of candidate pairs: R M: matched pairs, U: unmatched pairs Comparison vector, denoted , represents a set of similarity functions between observed attributes and . Our task is to determine M
Overview of Solution Collective network linkage approach (CNL) Is an unsupervised and probabilistic approach Integrates heterogeneous attributes Handles missing data Evaluates social similarity in a collective manner Can scale-up to large networks using LSH Solution Given a pair of users, denoted ,has similarity vector Assign a label to in terms of score of the pair which can be computed as
Empirical Study Task Datasets Our approach VS. baselines Self-linking for users from Twitter Linking users across Foursquare and Twitter Datasets Twitter TWN(x%): size N, noise x% Foursquare Ground-truth: 3,534 matched pairs Our approach VS. baselines CNL._.: CNLF-E, CNLnonN-E, CNLnonN-G, CNLLF-E NL ._.: NLnonN-E, NLnonN-G , NLF-E Mobius
Matching Score for Self-linking on TW1109(0%) NLnonN-E NLnonN-G CNLnonN-G Distribution assignment is very important The scores from CNL with correct distribution assignment are more distinguishable than these of NL CNLnonN-E
Compare with Mobius on TWN(10%) Precision CNLF-E outperforms Mobius significantly
Scalability Test for Self-linking on TWN(10%) Candidate pairs The elapsed time (Sec.) Less than 1% CNLLF-E can scale-up to large networks Precision Recall
Linking Heterogeneous Large Networks Precision Recall
DEMO: Linky http://research.larc.smu.edu.sg/linky/ Linky: Linking networks for unity Two networks Foursquare Twitter Four Features Username Social structure Temporal features Content features http://research.larc.smu.edu.sg/linky/
Conclusions Network linkage across heterogeneous social networks A unified and unsupervised approach Integrate heterogeneous user attributes and social connection Handle missing data Scale-up to large social networks Future work Distributed solution to improve the scalability Multiple networks, rather than two networks
Thank You for Your Attention
Integrates the Heterogeneous Attributes Attribute similarities can be discrete and continuous values Exponential family is a set of PDFs or PMFs Attribute similarities draw from a distribution in exponential family Log-likelihood Parameters for mat. & unmat. groups 2-dim. latent vector may be the missing values Pr(r \in M)
Handles the Missing Values Performs in an unsupervised manner and handles missing data It employs EM algorithm to estimate the parameters In the E-step, it replaces latent variables and missing values to their expectations
Scale-up to Large Networks Speed up the computation via using LSH LSH on usernames can be utilized to block users It also reduces the computation of social similarity
Links Network in a Collective Manner CNL works in a collective manner and consists of two stages: In the first iteration, only non-social attributes are used to link users Based on the result in the first iteration, then it can link users via integrating social similarity CNL is terminated if the convergence condition is held