David Inouye Georgia Institute of Technology 2011 DIMACS REU Intern at Rutgers University William M. Pottenger, Ph.D., Mentor * The content of this presentation has been adapted from a presentation given by Nir Grinberg. 06/07/20111
Introduction to Entity Resolution Entity resolution is the problem of deciding if two sets of data elements refer to the same real-world entity. 06/07/20112 Elements from Source 1Elements from Source 2 ? ? ?
Introduction to Entity Resolution Entity resolution is the problem of deciding if two sets of data elements refer to the same real-world entity. 06/07/20113 Elements from Source 1Elements from Source 2
Objective/Approach 06/07/20114 Standardize and Encode Calculate Similarity Scores Classify Using Ground Truth Data * WITS - GTD - Incidents in GTD*Incidents in WITS* Month: 6 Day: 28 Year: 2005 City: Dardsun, Kupwara Type: Arson Date: 06/27/2005 City: Kupwara Type: Fire attack
Phase 1: Standardize and Encode 06/07/20115 WITS Incident_ID DateCityState_ProvCountry /3/06UdhampurJammu and Kashmir India /27/2005KupwaraJammu and Kashmir India GTD Eventid IyearImonthIdayCityProvstatecountry PatnaBiharIndia Dardsun Kupwara Jammu & Kashmir (State) India
Phase 1: Standardize and Encode Standardize Dates Map WITS weapon types to GTD weapon types GeoCode location to latitude and longitude Extract topic model distribution using LDA 06/07/20116
Phase 1: Latent Dirichlet Allocation Generative probabilistic model Assumes topics are probability distributions of words Assumes documents are probability distributions of topics 06/07/20117
Phase 1: LDA Example 06/07/20118 * Example from “Probabilistic Topic Models” by Mark Steyvers.
Phase 1: Latent “Topics” (most probable words in topic) killed, kashmir, attack, injured, militants, suspected, blast, kill, bomb fired, upon, armed, killed, manipur, civilian, imphal, member, former civilian, kashmir, jammu, night, residence, kidnapped, one, village, doda police, one, killing, wounding, officers, two, officer, others, injuring jammu, kashmir, baramula, security, one, armed, anantnag, hizbul, mujahedin assam, explosive, front, improvised, device, liberation, united, ied, ulfa widely, two, civilians, national, tripura, kidnapped, three, village, karbi causing, injuries, damage, damaging, fire, station, set, detonated, train maoist, party, communist, cpi, widely, pradesh, andhra, chhattisgarh, village grenade, threw, civilians, wounding, srinagar, vehicle, two, kashmir, jammu 06/07/20119
Phase 1: Latent “Topics” (most probable words in topic) killed, kashmir, attack, injured, militants, suspected, blast, kill, bomb fired, upon, armed, killed, manipur, civilian, imphal, member, former civilian, kashmir, jammu, night, residence, kidnapped, one, village, doda police, one, killing, wounding, officers, two, officer, others, injuring jammu, kashmir, baramula, security, one, armed, anantnag, hizbul, mujahedin assam, explosive, front, improvised, device, liberation, united, ied, ulfa widely, two, civilians, national, tripura, kidnapped, three, village, karbi causing, injuries, damage, damaging, fire, station, set, detonated, train maoist, party, communist, cpi, widely, pradesh, andhra, chhattisgarh, village grenade, threw, civilians, wounding, srinagar, vehicle, two, kashmir, jammu 06/07/201110
Phase 2: Compute Similarity Dates 05/23/2001 vs. 05/22/2001 Nominal strings such as country or city “Jammu” vs. “Jammuu” GeoLocation Lat 32.8/Long 74.7 vs. Lat 32.27/Long 75.6 Topic distribution 06/07/201111
Phase 3: Classify as Match/Non-match 06/07/ * The Center for the Study of Terrorism and Responses to Terrorism (START) at the University of Maryland provided the human annotated ground truth data. Similarity Scores Classifier Model Based on Ground Truth* Match or Non-match
Phase 3: Classifier Results 06/07/ Classified Non-matchMatch Class Non-match Match116246
My research possibilities Clean up the ground truth data Improve upon the HO-LDA algorithm Consider how to compute different similarity scores 06/07/201114
06/07/201115