Presentation is loading. Please wait.

Presentation is loading. Please wait.

How we all fit together ©2014 Inome, Inc. All Rights Reserved. Probabilistic Estimates of Attribute Statistics and Match Likelihood for People Entity Resolution.

Similar presentations


Presentation on theme: "How we all fit together ©2014 Inome, Inc. All Rights Reserved. Probabilistic Estimates of Attribute Statistics and Match Likelihood for People Entity Resolution."— Presentation transcript:

1 How we all fit together ©2014 Inome, Inc. All Rights Reserved. Probabilistic Estimates of Attribute Statistics and Match Likelihood for People Entity Resolution Xin Wang, Ang Sun, Hakan Kardes, Siddharth Agrawal, Lin Chen, Andrew Borthwick

2 ©2014 Inome, Inc. All Rights Reserved. Our Mission Gather 20 billion raw records about people Publicly Available White Page (phone records, credit card headers) Property Record Court Record (criminal, civil, marriage/divorce) Social Media News Professional Conflate all the records about the same person together Create a graph of 250 million profiles: One profile for everybody in US

3 ©2014 Inome, Inc. All Rights Reserved.

4

5 Our Approach Formulate the problem as a Graph Partition task 7 billion nodes (each record as a node) Weights on edges are similarity scores from Machine Learning based models Cluster graph into 313.9 million clusters The Challenges Most graph partition algorithm can’t be scratch to such a scale Dynamic Blocking: Iteratively divide the graph into smaller subgraphs Limited resources: 88 node hadoop cluster for multiple monthly builds Number of clusters in a sub graph unknown People records are ambiguous by nature

6 ©2014 Inome, Inc. All Rights Reserved. Patricia Johnson 227 56th St, New York, NY 10022 Patricia Johnson Worked: Morgan Stanley, NY Low probability for two records with a common name in a big city to be about the same person

7 ©2014 Inome, Inc. All Rights Reserved. Patricia Johnson 227 56th St, New York, NY 10022 Patricia Johnson Worked: Morgan Stanley, NY Patricia Johnson Index Elementary School, Index, WA Patricia Johnson 402 5th St, Index, WA Two records with a Common name in a small town are more likely to be about the same person

8 ©2014 Inome, Inc. All Rights Reserved. Patricia Johnson 227 56th St, New York, NY 10022 Patricia Johnson Worked: Morgan Stanley, NY Patricia Johnson 402 5th St, Index WA Patricia Johnson Index Elementary School, Index, WA Patricia Johnson 227 56th St, New York, NY 10022 1502 SE 5 th St Bellevue, WA 312 Main St, Oberlin, OH DOB: 1974 Worked: Inome, Inc Patricia Johnson Worked: Morgan Stanley, NY BA, Oberlin College, 96 Combining evidence from multiple locations increases the match likelihood

9 ©2014 Inome, Inc. All Rights Reserved. Patricia Johnson 227 56th St, New York, NY 10022 Patricia Johnson Worked: Morgan Stanley, NY Patricia Johnson 402 5th St, Index WA Patricia Johnson Index Elementary School, Index, WA Patricia Johnson 227 56th St, New York, NY 10022 1502 SE 5 th St Bellevue, WA 312 Main St, Oberlin, OH DOB: 1974 Worked: Inome, Inc Patricia Johnson Worked: Morgan Stanley, NY BA, Oberlin College, 96 Patricia Johnson 227 56th St, New York, NY 10022 Patricia Johnson Worked: Morgan Stanley, NY DOB: 05/21/1974 Incorporating other demographic information also helps with matching two records

10 ©2014 Inome, Inc. All Rights Reserved. Approximate Match Likelihood of Two Records with Demographics Demographic information we can use: Name Frequency Population of US Population of a shared location Can be a city, zip-code, county, MSA, state, or distance based Patricia Johnson 227 56th St, New York, NY 10022 Patricia Johnson Worked: Morgan Stanley, NY Patricia Johnson Index Elementary School, Index, WA Patricia Johnson 402 5th St, Index, WA

11 ©2014 Inome, Inc. All Rights Reserved.

12 Approximate Match Likelihood of Two Records with Demographics Demographic information we can use: Name Frequency Population of US Population of a shared location Can be a city, zip-code, county, MSA, state, or distance based Birthday/Age information Patricia Johnson 227 56th St, New York, NY 10022 Patricia Johnson Worked: Morgan Stanley, NY DOB: 05/21/1974

13 ©2014 Inome, Inc. All Rights Reserved. Approximate Match Likelihood of Two Records with Demographics Patricia Johnson 227 56th St, New York, NY 10022 1502 SE 5 th St Bellevue, WA 312 Main St, Oberlin, OH DOB: 1974 Worked: Microsoft, Redmond, WA Patricia Johnson Worked: Morgan Stanley, NY BA, Oberlin College, 96 2 Beechwood Way, Scarborough, NY Worked: IBM Armonk, NY

14 ©2014 Inome, Inc. All Rights Reserved.

15 Approximate Match Likelihood of Two Records with Demographics Multiple Regions and multiple location matches in each region: Name Frequency of a Region Population of a Region State, MSA Population of a shared location Can be a city, zip-code, county, MSA, state, or distance based Birthday/Age information

16 ©2014 Inome, Inc. All Rights Reserved. Approximate Match Likelihood of Two Records with Demographics

17 ©2014 Inome, Inc. All Rights Reserved. How do we get the demographic statistics? 1.Population US Population State, MSA County, City, Zipcode 2.Name Frequencies US, State, MSA Different Combination of Name Components

18 ©2014 Inome, Inc. All Rights Reserved. Data Sources and Their Record Counts

19 ©2014 Inome, Inc. All Rights Reserved. Data Source Name Count Name Observations Source variance Source Priors True Count Name Priors Gaussian Truth Model For Estimating Name Frequencies

20 ©2014 Inome, Inc. All Rights Reserved. Source 1 Source 2 Source 3 Source N Extract Source Name Frequency Name Freq Table 1 Name Freq Table 2 Name Freq Table 3 Name Freq Table N Normalize Name Frequency Normalized Table 1 NormalizedTab le 2 NormalizedTab le 3 NormalizedTab le N EM Algorithm to Extract Source Bias and Compute the true Name Frequency NormalizedEsti mates Denormalize Name Frequency Evaluators True Estimates Best Sources Config Name Freq Mean and Standard Error Table Implementation of GTM for Name Frequency Truth Estimation

21 ©2014 Inome, Inc. All Rights Reserved. Contribution of the Demographic Based Likelihood Feature Name Frequency Estimates (First Last) Experimental Results:

22 ©2014 Inome, Inc. All Rights Reserved. Q & A

23 ©2014 Inome, Inc. All Rights Reserved. William H Gates II William H Gates William H Gates III Bill Gates 123 Main St Seattle, WA 235 NE 14 St Seattle, WA Bill and Melinda Gates Foundation William H Gates 621 Main St Bellevue WA 1925 1955 JD, UW Harvard

24 ©2014 Inome, Inc. All Rights Reserved. Name Address History Phone Age/DOB SSN Raw Public Record Work History Education History Text Extracts Email Websites Raw Social Name Features Location Features Phone Features DOB Features SSN Features Demographic Features Relative Based Features ‘Household’ based features Company-wide features Graph Based Features Email Fields Domain/URL Fields Education Fields Work History Fields Text Based Features Whole Field Match Other Fields N-Gram Features Gender-based Neighborhood features Multi Feature Combination (Sum/Max) Combo Features Likelihood Combo Name Birthday Population (NBP) Score Regional NBP Score Linkedin NBP Score RelatedByPhone Likelihood RelatedByAddr Likelihood Regional Population CombinedNameFreq NotSameNameAndNotSimilar Propositional Logic Combo AND, OR, NOT Propositional Logic Combo AND, OR, NOT if_[!title_whole_field_weighte d]_then_[title_correlation] ExactFLAndDifferentMiddleFe male Multi-Field N-Gram Features EmployerTagHybrid SchoolTagHybrid BlurbTagHybrid BlurbAnchorHybrid JobtitleEmployerMultiFieldHy brid Company Location Name Age Inferred Information Keywords Histograms US/Global Geo Dist/Pop Dict US/Global Name Freq Dicts Education Institute Dicts Business loc/Employee Dicts Email/alias Freq Dicts Data Dictionaries (Mined/Purchased) N-Gram Dictionaries Phone Freq Dicts Address Frequency Dicts Ethnicity Biological Information Case Information Criminal Records Biometrics Biometrics Features Offense Base Features

25 ©2014 Inome, Inc. All Rights Reserved. 500 108 Ave NE, Bellevue, WA Patricia Johnson Timothy Johnson Emily Johnson 227 E 56th St, New York, NY 10022 Stuart Johnson 227 56th St, New York, NY 10022 1502 NE 5 th St, Bellevue, WA 1502 SE 5 th St Bellevue, WA 4253231000 312 Main St, Oberlin, OH BA, Oberlin College

26 How we all fit together ©2014 Inome, Inc. All Rights Reserved. Thank You


Download ppt "How we all fit together ©2014 Inome, Inc. All Rights Reserved. Probabilistic Estimates of Attribute Statistics and Match Likelihood for People Entity Resolution."

Similar presentations


Ads by Google