Presentation is loading. Please wait.

Presentation is loading. Please wait.

Collective Network Linkage across Heterogeneous Social Platforms

Similar presentations


Presentation on theme: "Collective Network Linkage across Heterogeneous Social Platforms"— Presentation transcript:

1 Collective Network Linkage across Heterogeneous Social Platforms
International Conference on Data Mining, Atlantic City, NJ, USA Ming Gao Institute for Data Science and Engineering East China Normal University, Shanghai, China

2 Our Co-authors Ee-Peng Lim David Lo Feida Zhu
(Singapore Management University) David Lo Feida Zhu Philips Kokoh Prasetyo (Singapore Management University) Aoying Zhou (East China Normal University)

3 Roadmap Related Work Solution Empirical Study Conclusions Background

4 Background Social media sites have become extremely popular in recent years People maintain their accounts and social connections with different social media sites simultaneously Major applications of network linkage: Across different social networks Profile user Understand user behavior Recommend users or products across networks On a single social network Detect duplicates from the single network

5 Related Work Network linkage across different social networks There is no unsupervised approach which is not domain-specific and can also handle missing and incomplete data. These are the focuses of this work.

6 Network Linkage VS. Record Linkage
Object Social user Relational data Attribute Heterogeneous Simple Implicit and explicit Explicit Unfixed Fixed Missing data Many Few Relationship True False Challenges Heterogeneous attributes Noise or missing data in user attributes Social connection across heterogeneous networks Many pairs for consideration

7 Formulation Two networks: A and B The set of candidate pairs: R
M: matched pairs, U: unmatched pairs Comparison vector, denoted , represents a set of similarity functions between observed attributes and Our task is to determine M

8 Overview of Solution Collective network linkage approach (CNL)
Is an unsupervised and probabilistic approach Integrates heterogeneous attributes Handles missing data Evaluates social similarity in a collective manner Can scale-up to large networks using LSH Solution Given a pair of users, denoted ,has similarity vector Assign a label to in terms of score of the pair which can be computed as

9 Empirical Study Task Datasets Our approach VS. baselines
Self-linking for users from Twitter Linking users across Foursquare and Twitter Datasets Twitter TWN(x%): size N, noise x% Foursquare Ground-truth: 3,534 matched pairs Our approach VS. baselines CNL._.: CNLF-E, CNLnonN-E, CNLnonN-G, CNLLF-E NL ._.: NLnonN-E, NLnonN-G , NLF-E Mobius

10 Matching Score for Self-linking on TW1109(0%)
NLnonN-E NLnonN-G CNLnonN-G Distribution assignment is very important The scores from CNL with correct distribution assignment are more distinguishable than these of NL CNLnonN-E

11 Compare with Mobius on TWN(10%)
Precision CNLF-E outperforms Mobius significantly

12 Scalability Test for Self-linking on TWN(10%)
Candidate pairs The elapsed time (Sec.) Less than 1% CNLLF-E can scale-up to large networks Precision Recall

13 Linking Heterogeneous Large Networks
Precision Recall

14 DEMO: Linky http://research.larc.smu.edu.sg/linky/
Linky: Linking networks for unity Two networks Foursquare Twitter Four Features Username Social structure Temporal features Content features

15 Conclusions Network linkage across heterogeneous social networks
A unified and unsupervised approach Integrate heterogeneous user attributes and social connection Handle missing data Scale-up to large social networks Future work Distributed solution to improve the scalability Multiple networks, rather than two networks

16 Thank You for Your Attention

17 Integrates the Heterogeneous Attributes
Attribute similarities can be discrete and continuous values Exponential family is a set of PDFs or PMFs Attribute similarities draw from a distribution in exponential family Log-likelihood Parameters for mat. & unmat. groups 2-dim. latent vector may be the missing values Pr(r \in M)

18 Handles the Missing Values
Performs in an unsupervised manner and handles missing data It employs EM algorithm to estimate the parameters In the E-step, it replaces latent variables and missing values to their expectations

19 Scale-up to Large Networks
Speed up the computation via using LSH LSH on usernames can be utilized to block users It also reduces the computation of social similarity

20 Links Network in a Collective Manner
CNL works in a collective manner and consists of two stages: In the first iteration, only non-social attributes are used to link users Based on the result in the first iteration, then it can link users via integrating social similarity CNL is terminated if the convergence condition is held


Download ppt "Collective Network Linkage across Heterogeneous Social Platforms"

Similar presentations


Ads by Google