Presentation is loading. Please wait.

Presentation is loading. Please wait.

David Inouye Georgia Institute of Technology 2011 DIMACS REU Intern at Rutgers University William M. Pottenger, Ph.D., Mentor * The content of this presentation.

Similar presentations


Presentation on theme: "David Inouye Georgia Institute of Technology 2011 DIMACS REU Intern at Rutgers University William M. Pottenger, Ph.D., Mentor * The content of this presentation."— Presentation transcript:

1 David Inouye Georgia Institute of Technology 2011 DIMACS REU Intern at Rutgers University William M. Pottenger, Ph.D., Mentor * The content of this presentation has been adapted from a presentation given by Nir Grinberg. 06/07/20111

2 Introduction to Entity Resolution Entity resolution is the problem of deciding if two sets of data elements refer to the same real-world entity. 06/07/20112 Elements from Source 1Elements from Source 2 ? ? ?

3 Introduction to Entity Resolution Entity resolution is the problem of deciding if two sets of data elements refer to the same real-world entity. 06/07/20113 Elements from Source 1Elements from Source 2

4 Objective/Approach 06/07/20114 Standardize and Encode Calculate Similarity Scores Classify Using Ground Truth Data * WITS - https://wits.nctc.gov/; GTD - Incidents in GTD*Incidents in WITS* Month: 6 Day: 28 Year: 2005 City: Dardsun, Kupwara Type: Arson Date: 06/27/2005 City: Kupwara Type: Fire attack

5 Phase 1: Standardize and Encode 06/07/20115 WITS Incident_ID DateCityState_ProvCountry /3/06UdhampurJammu and Kashmir India /27/2005KupwaraJammu and Kashmir India GTD Eventid IyearImonthIdayCityProvstatecountry PatnaBiharIndia Dardsun Kupwara Jammu & Kashmir (State) India

6 Phase 1: Standardize and Encode Standardize Dates Map WITS weapon types to GTD weapon types GeoCode location to latitude and longitude Extract topic model distribution using LDA 06/07/20116

7 Phase 1: Latent Dirichlet Allocation Generative probabilistic model Assumes topics are probability distributions of words Assumes documents are probability distributions of topics 06/07/20117

8 Phase 1: LDA Example 06/07/20118 * Example from “Probabilistic Topic Models” by Mark Steyvers.

9 Phase 1: Latent “Topics” (most probable words in topic) killed, kashmir, attack, injured, militants, suspected, blast, kill, bomb fired, upon, armed, killed, manipur, civilian, imphal, member, former civilian, kashmir, jammu, night, residence, kidnapped, one, village, doda police, one, killing, wounding, officers, two, officer, others, injuring jammu, kashmir, baramula, security, one, armed, anantnag, hizbul, mujahedin assam, explosive, front, improvised, device, liberation, united, ied, ulfa widely, two, civilians, national, tripura, kidnapped, three, village, karbi causing, injuries, damage, damaging, fire, station, set, detonated, train maoist, party, communist, cpi, widely, pradesh, andhra, chhattisgarh, village grenade, threw, civilians, wounding, srinagar, vehicle, two, kashmir, jammu 06/07/20119

10 Phase 1: Latent “Topics” (most probable words in topic) killed, kashmir, attack, injured, militants, suspected, blast, kill, bomb fired, upon, armed, killed, manipur, civilian, imphal, member, former civilian, kashmir, jammu, night, residence, kidnapped, one, village, doda police, one, killing, wounding, officers, two, officer, others, injuring jammu, kashmir, baramula, security, one, armed, anantnag, hizbul, mujahedin assam, explosive, front, improvised, device, liberation, united, ied, ulfa widely, two, civilians, national, tripura, kidnapped, three, village, karbi causing, injuries, damage, damaging, fire, station, set, detonated, train maoist, party, communist, cpi, widely, pradesh, andhra, chhattisgarh, village grenade, threw, civilians, wounding, srinagar, vehicle, two, kashmir, jammu 06/07/201110

11 Phase 2: Compute Similarity Dates 05/23/2001 vs. 05/22/2001 Nominal strings such as country or city “Jammu” vs. “Jammuu” GeoLocation Lat 32.8/Long 74.7 vs. Lat 32.27/Long 75.6 Topic distribution 06/07/201111

12 Phase 3: Classify as Match/Non-match 06/07/ * The Center for the Study of Terrorism and Responses to Terrorism (START) at the University of Maryland provided the human annotated ground truth data. Similarity Scores Classifier Model Based on Ground Truth* Match or Non-match

13 Phase 3: Classifier Results 06/07/ Classified Non-matchMatch Class Non-match Match116246

14 My research possibilities Clean up the ground truth data Improve upon the HO-LDA algorithm Consider how to compute different similarity scores 06/07/201114

15 06/07/201115


Download ppt "David Inouye Georgia Institute of Technology 2011 DIMACS REU Intern at Rutgers University William M. Pottenger, Ph.D., Mentor * The content of this presentation."

Similar presentations


Ads by Google