Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine

Slides:



Advertisements
Similar presentations
Números.
Advertisements

Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Background images courtesy of abc.com 1,000, , , ,000 64,000 32,000 16,000 8,000 4,000 2,000 1,
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Reflection nurulquran.com.
EuroCondens SGB E.
Worksheets.
BEA Local Area Personal Income and Employment county data ( ), 04/22/2010 release. Estimates of detailed employment and wage data for the states.
Addition and Subtraction Equations
Multiplication X 1 1 x 1 = 1 2 x 1 = 2 3 x 1 = 3 4 x 1 = 4 5 x 1 = 5 6 x 1 = 6 7 x 1 = 7 8 x 1 = 8 9 x 1 = 9 10 x 1 = x 1 = x 1 = 12 X 2 1.
Division ÷ 1 1 ÷ 1 = 1 2 ÷ 1 = 2 3 ÷ 1 = 3 4 ÷ 1 = 4 5 ÷ 1 = 5 6 ÷ 1 = 6 7 ÷ 1 = 7 8 ÷ 1 = 8 9 ÷ 1 = 9 10 ÷ 1 = ÷ 1 = ÷ 1 = 12 ÷ 2 2 ÷ 2 =
1 When you see… Find the zeros You think…. 2 To find the zeros...
/4/2010 Box and Whisker Plots Objective: Learn how to read and draw box and whisker plots Starter: Order these numbers.
Add Governors Discretionary (1G) Grants Chapter 6.
CALENDAR.
Making a Line Plot Collect data and put in chronological order
1 1  1 =.
1  1 =.
CHAPTER 18 The Ankle and Lower Leg
Introduction to Turing Machines
ASCII stands for American Standard Code for Information Interchange
I can interpret intervals on partially numbered scales and record readings accurately ? 15 ? 45 ? 25 ? 37 ? 53 ? 64 Each little mark.
The 5S numbers game..
突破信息检索壁垒 -SciFinder Scholar 介绍
Solve Multi-step Equations
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Sampling in Marketing Research
The basics for simulations
© 2010 Concept Systems, Inc.1 Concept Mapping Methodology: An Example.
CrIMSS EDR Performance Assessment and Tuning Alex Foo, Xialin Ma and Degui Gu Sept 11, 2012.
Look at This PowerPoint for help on you times tables
Finish Test 15 minutes Wednesday February 8, 2012
Frequency Tables and Stem-and-Leaf Plots 1-3
Figure 3–1 Standard logic symbols for the inverter (ANSI/IEEE Std
Money Math Review.
TCCI Barometer March “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
Dynamic Access Control the file server, reimagined Presented by Mark on twitter 1 contents copyright 2013 Mark Minasi.
Area of triangles.
TCCI Barometer March “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
NEW JERSEY DEPARTMENT OF LABOR PROJECTIONS SYSTEM Industry Projections Occupational Projections Population Projections Labor Force Projections Labor Force.
Progressive Aerobic Cardiovascular Endurance Run
Evelyn CP School Foundation Stage Results (Specific Learning Goals – Reading, Writing and Number) 2013 Reading (Expected) 77% Writing (Expected) 43% Number.
November 7, 2013 David Armstrong SCWI Project Officer 1.
Name of presenter(s) or subtitle Canadian Netizens February 2004.
Adding Up In Chunks.
CSE 6007 Mobile Ad Hoc Wireless Networks
UNIT 2: SOLVING EQUATIONS AND INEQUALITIES SOLVE EACH OF THE FOLLOWING EQUATIONS FOR y. # x + 5 y = x 5 y = 2 x y = 2 x y.
When you see… Find the zeros You think….
2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.
Before Between After.
Benjamin Banneker Charter Academy of Technology Making AYP Benjamin Banneker Charter Academy of Technology Making AYP.
2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.
Addition and subtraction on a number line links between the model and appropriate ‘vertical’ method Janine Blinko.
Foundation Stage Results CLL (6 or above) 79% 73.5%79.4%86.5% M (6 or above) 91%99%97%99% PSE (6 or above) 96%84%100%91.2%97.3% CLL.
Subtraction: Adding UP
ADDING UP. CATEGORY 1 Adding Up strategy : The whole number is a multiple of ten or one hundred. The subtrahend is close to a multiple of ten or a landmark.
Numeracy Resources for KS2
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Number bonds to 10,
Static Equilibrium; Elasticity and Fracture
ANALYTICAL GEOMETRY ONE MARK QUESTIONS PREPARED BY:
Resistência dos Materiais, 5ª ed.
Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.
Biostatistics course Part 14 Analysis of binary paired data
A Data Warehouse Mining Tool Stephen Turner Chris Frala
Introduction Embedded Universal Tools and Online Features 2.
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik University of California, Irvine.
Presentation transcript:

Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine

Increasing Popularity of Reviewing Sites Yelp, more than 39M visitors and 15M reviews in 2010

category Rating

Rising Awareness of Privacy

How Privacy apply to Reviews? Traceability Linkability of Ad hoc Reviews Linkablility of Several Accounts

Contribution Extensive Study to Measure privacy/linakability in user reviews Propose models that adequately identify authors

Settings & Problem Formulation

IR: Identified Record IRIR IRIR IRIR IRIR AR AR: Anonymous Record

Anonymous Record Size (AR) Identified Record Size (IR) Matching Model TOP-X Linkability X: 1 and 10 1, 5, 10, 20,…60

Dataset 1 Million Reviews 2000 Users more than 300 review

Methodology Naïve Bayesian Model Kullback-Leibler Model Symmetric Version

Methodology Anonymous Record AR -> Identified Record IR Naïve Bayesian Model, NB Max IRi P(AR|IR i ) Kullback-Leibler Divergence, KLD Distance(AR, IR_i) and return IR_i with MIN

Naïve Bayesian (NB) Identified Record (IR) Anonymous Record (AR) Decreasing Sorted List of IRs

Naïve Bayesian Identified Record Anonymous Record Sorted List of IRs

Kullback-Leibler Divergence (KLD) Identified Record (IR) Anonymous Record (AR) Increasing Sorted List of IRs

Maximum Likelihood Estimation

Tokens Unigram: a, ….z Digram: aa, ab,…,zz Rating :1,2,3,4,5 Category: restaurant, Beauty and Spa, Education

Lexical Token Results

NB -Unigram Size 60, LR 83%/ Top-1 LR 96% Top-10

KLD - Unigram Size 60, LR 83%/ Top-1 LR 96% Top-10

NB Digram Size 20, LR 97%/ Top-1 Size10, LR 88%/ Top-1

KLD Digram Size 60, LR 99%/ Top-1 Size 30, LR 75%/ Top-1

Improvement (1): Combining Lexical and non- Lexical ones

Combining in NB model Straightforward P(Rating|IR), P(Category|IR) But for KLD? Weighted Average

First, Combine Rating and Category Second, Combine non-lexical and lexical /0.97 for Unigram/Digram

Rating and Category Beta Value of 0.5

Non-lexical and Unigram Alpha Value of 0.997

Non-Lexical and Digram Alpha Value of 0.97

Token Combining Results

Rating, Category, and Unigram - NB Gain, up to 20% Size 30, 60 % To 80% Size 60, 83 % To 96%

Rating, Category, and Unigram - KLD Gain, up to 12% Size 40, 68 % To 80% Size 60, 83 % To 92%

Rating, Category, and Digram - NB

Rating, Category, and Digram - KLD

What about Restricting Identified Record (IR) Size?

Anonymous Record Size (AR) Identified Record Size (IR) Matching Model TOP-X Linkability X: 1 and 10

Anonymous Record Size (AR) Identified Record Size (IR) Matching Model TOP-X Linkability X: 1 and 10

Restricted IR - NB Affected by IR size

Restricted IR - KLD Performed better for smaller IR Size 20 or less, improved The rest, comparable

What about Matching All ARs at once?

Anonymous Record Size (AR) Identified Record Size (IR) Matching Model TOP-X Linkability X: 1 and 10

Anonymous Records (ARs) Identified Records (IRs) Matching Model

Improvement (2): Matching All IRs At Once

MatchAll - Restricted Gain, up to 16% Size 30, From 74% To 90%

Matchall - Full Gain, up to 23% Size 20, From 35% To 55%

Improvement (3): For Small IR Size

Changing it to: Review Length

Results – Improvement (3) Size 10, 89% To 92% Size 7, 79% To 84% Gain up to 5%

Discussion Implications Cross-Referencing Review Spam Non-Prolific Users Gradually becomes prolific IR of 20, Link Around 70% Anonymous Record Size Linkability high even for small (92% for AR of 10) 60 only 20% of min user contribution

Discussion (cont.) Unigram Token Very Comparable for larger AR Entail less resources in the attach 26 VS 676

Future Directions Improving more for Small ARs Other Probabilistic Models Using Stylometry Exploring Linkability in other Preference Databases More than one AR for different Users: Exploring it more

Conclusion Extensive Study to Assess Linkability of User Reviews For large set of users Using very simple features Users are very exposed even with simple features and large number of authors

Thank you all!