Presentation is loading. Please wait.

Presentation is loading. Please wait.

Large-Scale Cost-sensitive Online Social Network Profile Linkage.

Similar presentations


Presentation on theme: "Large-Scale Cost-sensitive Online Social Network Profile Linkage."— Presentation transcript:

1 Large-Scale Cost-sensitive Online Social Network Profile Linkage

2 Background & Motivation Foot prints in different social networks. User identification in social analysis. Privacy & security Commercial & government applications

3 Outline Problem definition Related work Approach Experiment Conclusion & future work

4 Problem Definition Terminology Identity: Person Profile/User: Your footprint on social media Profile Linkage: Link your footprints together Input & Output Input: profiles of one site as QUERY and profiles of the other site as TARGET. Output: all pairs of classified matched profiles.

5 Characteristics of profile Name (semi vs. structured) {“given name”: “haochen”, “family name”: “zhang”} name: zhang haochen Semi-structured schema Incompleteness & missing attributes Privacy policy Virtual identification Free text description Bio, About me, Tags Multilingualism

6 Top 5 languages in dataset of Facebook English Portuguese Spanish Chinese French Most frequent tokens in different languages chris, john, michael chen, wang, lee carlos, garcia, daniel sergey, olga, alexander About 70% users are in English 7.2% users register as different locales Transliteration 昊辰 => Haochen

7 Feature Acquisition Network communication costs too much time. Usage limit of the web service. 1000 invocations per day for Google Maps API Compute complexity comparing to string similarity. Image processing algorithm.

8 Related work User linking across the social networks Record linkage and entity resolution Cost-sensitive feature acquisition

9 Overview of approach Classification of Potential Links Features representation Supervised learning Cost-sensitive Feature Acquisition Pruning with Canopy Parameter tuningCanopy construction Entity-based Representation of Profiles MappingTokenizationEntity extraction

10 Canopy: design

11 Canopy: efficiency

12 Local Features Username Jaro Winkler Similarity Language Jaccard Simlarity Description, URL Cosine similarity with TF×IDF Popularity Defined as the friend amount of a user. Adopt following metric

13 External Features Geographic Location Values are diverse with different types. Google Maps API: string-represented location => geographic information Spherical distance between two locations as the feature Avatar χ 2 dissimilarity of the avatar’s gray-scale histogram.

14 Classification: learning Probabilistic model derived from naïve bayes Independent feature assumption

15 Classification: learning Iterative inference Terminate if S_n is discriminative. Set up threshold by choosing the error rate in training set of each feature to determine whether S_n is discriminative Order of the features

16 Classification: learning Initial value Estimate by the prior that two profiles sharing rarer tokens are more likely to be matched. as the initial value

17 Dataset of experiment Data source 152,294 Twitter users 154,379 LinkedIn users Ground truth: 9,750 identities 4,779 identities with both accounts. 3,339 identities with only Twitter account. 1,632 identities with only LinkedIn account.

18 Experiment: Performance on overall linkage I-Acc(Identity Accuracy) correctly identified identities / all identities in ground truth Better than naïve learning method caused by adopting the prior. Different performance on different learning methods.

19 Experiment: Cost-sensitive feature acquisition 5% improvement of F1 by taking 148743 external feature acquisitions. Different order of external features. Rank by cost Rank by distinguishability Three sections divided by two inflection points.

20 Discussion: dataset construction Dataset construction Connections Cannot correctly reflect the web-scale occasion. Name is too significant. People search Difficult to construct the ground truth. Solution?

21 Discussion: people search task Query in LinkedIn by Twitter user’s name Average 10 results for each query PreRecF1 Human0.6430.9000.750 NB_Local0.3690.4410.402 NB_All0.4180.4930.453 C4.5_Local0.5940.2400.342 C4.5_All 0.609 0.3800.468 CSPL_Local0.5430.6580.595 CSPL_All0.578 0.7130.638

22 Discussion: feature dependency Compare features independently. 2 people in Tsinghua with same name Li Peng 2 people in NUS with same name Li Peng Construct different IDF table for name in different locale. Not generally Not significantly effective

23 Conclusion We proposed an supervised probabilistic to solve the identity linkage problem effectively. Prior that users sharing rarer tokens are more likely matched improves the performance of the approach. Iterative inference is able to reduce unnecessary feature acquisitions.

24 Thank you


Download ppt "Large-Scale Cost-sensitive Online Social Network Profile Linkage."

Similar presentations


Ads by Google