Large-Scale Entity-Based Online Social Network Profile Linkage.

1 Large-Scale Entity-Based Online Social Network Profile Linkage

2 Background & Motivation Foot prints in different social networks. User identification in social analysis. Privacy & security Commercial & government applications

3 Outline Problem definition Related work Approach Experiment Conclusion & future work

4 Problem Definition Terminology Identity: Person Profile/User: Your footprint on social media Profile Linkage: Link your footprints together Input & Output Input: profiles of one site as QUERY and profiles of the other site as TARGET. Output: all pairs of classified matched profiles.

5 Characteristics of profile Name (semi vs. structured) {“given name”: “haochen”, “family name”: “zhang”} name: zhang haochen Semi-structured schema Incompleteness & missing attributes Privacy policy Virtual identification Free text description Bio, About me, Tags Multilingualism

6 Top 5 languages in dataset of Facebook English Portuguese Spanish Chinese French Most frequent tokens in different languages chris, john, michael chen, wang, lee carlos, garcia, daniel sergey, olga, alexander About 70% users are in English 7.2% users register as different locales Transliteration 昊辰 => Haochen

7 Related work String similarity metrics name matching (Jaro Winkler, Edit distance) VSM (TF-IDF) Pair-wise comparison schema matching for different prototypes unsupervised vs. supervised Indexing technique blocking canopy

8 Overview of approach Classification of Potential Links Features representation Supervised learning Pruning with Canopy Parameter tuning Canopy construction Entity-based Representation of Profiles Mapping Tokenizatio n Entity extraction

9 Entity-based representation Extraction Attribute => Entity Named-entity recognition Language detector Regular expression Involved entities Username Name Location Organization URL Language Country Gender Birth Tokenization General representation Microsoft Word Breaker Preparation for canopy

10 Canopy: design

11 Canopy: efficiency

12 Classification: features String similarity (Jaro Winkler Similarity) Username, Name Token similarity (n-gram, IDF) Username, Name, Location, URL, Organization All tokens Enumeration identity Language, Country

13 Classification: learning Supervised Learning SVM Naïve Bayes C4.5 AdaBoost with C4.5 Problems Imbalance between positive instances and negative instances.

14 Dataset of experiment Data source Google+ Twitter Facebook 20000+ profiles for each social network and 10000+ matched pairs

15 Experiment on artificial dataset Balanced dataset with equal amount of positive instances and randomly selected negative instances, which is quite different between matched links and unmatched links. Name features are most important features while largely mirror each other Country extracted from URL may be deceptive.

16 Experiment on overall dataset Imbalanced dataset and ratio of POS and NEG could be 1:100 even after pruning with canopy. More similar pending pairs hurt performance. Failed pruning and excessively relying on name features hurts recall.

17 Parameter tuning Greater threshold brings more candidates that interference classifier. Less threshold prunes more matched links by mistake.

18 Efficiency

19 Conclusion We have investigated characteristics of social network user profiles. We proposed an supervised approach with canopy to solve large-scale profile linkage task. The approach is proved to be both effective and efficient, while the run-time and complexity can be controlled.

20 Future work Investigate deeper in characteristics of profiles in different locale. Improving learning techniques. Automatic semi-structured profile comparison or schema mapping. Improving approach for web people search task.

21 Theta vs. Corpus size |Q|+|T|theta(G+ vs. F)theta(G+ vs. T)theta(T vs. F) 1500050 2000030010030 2500020050200 30000200100200 35000300100 40000100200 45000100 50000200

22 Web People Search Search from search engine. query by username or tokens 3 * 2 * 8 * 2 = 96 queries, 675 candidates, 63 ground truth Evaluation (Training classifier with overall training set) SVM: P=0.30, R=0.77, F1=0.43 and Accuracy=0.81 AdaBoosted C4.5: P=0.11, R=0.96, F1=0.21 and Accuracy=0.31 Too similar username or name to correctly classify unmatched instance.

23 Thank you

