Download presentation
Presentation is loading. Please wait.
Published byMacie Hilbert Modified over 9 years ago
1
David Inouye Georgia Institute of Technology 2011 DIMACS REU Intern at Rutgers University William M. Pottenger, Ph.D., Mentor * The content of this presentation has been adapted from a presentation given by Nir Grinberg. 06/07/20111
2
Introduction to Entity Resolution Entity resolution is the problem of deciding if two sets of data elements refer to the same real-world entity. 06/07/20112 Elements from Source 1Elements from Source 2 ? ? ?
3
Introduction to Entity Resolution Entity resolution is the problem of deciding if two sets of data elements refer to the same real-world entity. 06/07/20113 Elements from Source 1Elements from Source 2
4
Objective/Approach 06/07/20114 Standardize and Encode Calculate Similarity Scores Classify Using Ground Truth Data * WITS - https://wits.nctc.gov/; GTD - http://www.start.umd.edu/gtd/ Incidents in GTD*Incidents in WITS* Month: 6 Day: 28 Year: 2005 City: Dardsun, Kupwara Type: Arson Date: 06/27/2005 City: Kupwara Type: Fire attack
5
Phase 1: Standardize and Encode 06/07/20115 WITS Incident_ID DateCityState_ProvCountry 4042612/3/06UdhampurJammu and Kashmir India 156496/27/2005KupwaraJammu and Kashmir India GTD Eventid IyearImonthIdayCityProvstatecountry 20040414 0003 2004414PatnaBiharIndia 20050628 0004 2005628Dardsun Kupwara Jammu & Kashmir (State) India
6
Phase 1: Standardize and Encode Standardize Dates Map WITS weapon types to GTD weapon types GeoCode location to latitude and longitude Extract topic model distribution using LDA 06/07/20116
7
Phase 1: Latent Dirichlet Allocation Generative probabilistic model Assumes topics are probability distributions of words Assumes documents are probability distributions of topics 06/07/20117
8
Phase 1: LDA Example 06/07/20118 * Example from “Probabilistic Topic Models” by Mark Steyvers. http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf
9
Phase 1: Latent “Topics” (most probable words in topic) killed, kashmir, attack, injured, militants, suspected, blast, kill, bomb fired, upon, armed, killed, manipur, civilian, imphal, member, former civilian, kashmir, jammu, night, residence, kidnapped, one, village, doda police, one, killing, wounding, officers, two, officer, others, injuring jammu, kashmir, baramula, security, one, armed, anantnag, hizbul, mujahedin assam, explosive, front, improvised, device, liberation, united, ied, ulfa widely, two, civilians, national, tripura, kidnapped, three, village, karbi causing, injuries, damage, damaging, fire, station, set, detonated, train maoist, party, communist, cpi, widely, pradesh, andhra, chhattisgarh, village grenade, threw, civilians, wounding, srinagar, vehicle, two, kashmir, jammu 06/07/20119
10
Phase 1: Latent “Topics” (most probable words in topic) killed, kashmir, attack, injured, militants, suspected, blast, kill, bomb fired, upon, armed, killed, manipur, civilian, imphal, member, former civilian, kashmir, jammu, night, residence, kidnapped, one, village, doda police, one, killing, wounding, officers, two, officer, others, injuring jammu, kashmir, baramula, security, one, armed, anantnag, hizbul, mujahedin assam, explosive, front, improvised, device, liberation, united, ied, ulfa widely, two, civilians, national, tripura, kidnapped, three, village, karbi causing, injuries, damage, damaging, fire, station, set, detonated, train maoist, party, communist, cpi, widely, pradesh, andhra, chhattisgarh, village grenade, threw, civilians, wounding, srinagar, vehicle, two, kashmir, jammu 06/07/201110
11
Phase 2: Compute Similarity Dates 05/23/2001 vs. 05/22/2001 Nominal strings such as country or city “Jammu” vs. “Jammuu” GeoLocation Lat 32.8/Long 74.7 vs. Lat 32.27/Long 75.6 Topic distribution 06/07/201111
12
Phase 3: Classify as Match/Non-match 06/07/201112 * The Center for the Study of Terrorism and Responses to Terrorism (START) at the University of Maryland provided the human annotated ground truth data. Similarity Scores Classifier Model Based on Ground Truth* Match or Non-match
13
Phase 3: Classifier Results 06/07/201113 Classified Non-matchMatch Class Non-match9875511 Match116246
14
My research possibilities Clean up the ground truth data Improve upon the HO-LDA algorithm Consider how to compute different similarity scores 06/07/201114
15
06/07/201115
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.