Named Entity Disambiguation using Linked Data Danica Damljanović The University of Sheffield Brunel University London, 05 March 2012
Named Entity Disambiguation in TrendMiner Newswire Market data Polls … Multilingual Text Processing (EN, DE, IT, BG, HI) Time-Series Machine Learning models Cross-Lingual Summarisation Knowledge-based Search and Browse TrendMiner Platform Financial Decisions Political Analysis Named Entity Recognition is the first step: and it is important to get it right! Hardik Fintrade Pvt. Ltd. SORA Eurokleis srl
Example
Linked Data Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.
Why DBpedia? Regularly updated (from Wikipedia) Good source for named entities A hierarchy of concepts a capital is also a city, but not vice versa Relations between concepts Paris locatedIn France ParisHilton bornIn NewYorkCity
Task Identify named entities in text and attach the correct DBpedia URI to each one of them
Named Entity Recognition ANNIE Produces NE types such as Organization, Location and Person Resolves coreference Entities with the same meaning are linked E.g. General Motors and GM
Entity Linking The Large Knowledge Gazetteer (LKB) Matches text against URIs Match only against the values of The rdf:label and foaf:name properties For all instances of the classes: dbpedia-ont:Person dbpedia-ont:Organisation dbpedia-ont:Place classes.
So, why not just combine them? NE types generated by ANNIE miss the URI LKB does not use any context Spurious entities E.g. each letter B is annotated as a possible mention of dbpedia:B_%28Los_Angeles Railway%29 Refers to a line called B operated by Los Angeles Railway
How to filter out the noise? Identify NEs (Location, Organisation and Person) using ANNIE For each NE add URIs of matching instances from DBpedia For each ambiguous NE calculate disambiguation scores Remove all matches except the highest scoring one
Disambiguation score Uses context A weighted sum of the three similarity metrics String similarity Structural similarity Contextual similarity
String similarity Refers to the edit distance between the text string, and the labels matching URIs Paris and Paris Hilton Levenshtein: Jaccard: 0.5 MongeElcan: 1.0 Paris and Paris, Ontario Levenshtein: Jaccard: 0.0 MongeElcan: 1.0 Paris Hilton and Paris, Ontario Levenshtein: JaccardSimilarity: 0.0 MongeElcan:
Structural similarity Is there a relation between the ambiguous NE and any other NE from the same sentence or document? Paris....France >> true (Paris capitalOf France) Paris...New York>>true (ParisHilton bornIn NewYorkCity)
Contextual similarity The probability that two words appear with a similar set of other words (Random Indexing) Paris FranceParis OntarioParis Hilton :paris :métro :paul-martin :lewden :pimpfen :théas :werfft :birmoverse :cszhech :pierre :paris :ontario :merrickville-wolford :naiscoutaing :neguaguon :magnetewan :wabauskang :tp :s-e :henvey :hilton :paris :poverty-related :jaumont :jaune-montagne :malancourt-la- montagne :mons–january :métro :tank-tread :“plane’s
Evaluation PrecisionRecallf-measure LKB LKB+ANNIE LKB+ANNIE+Disambiguation Wikipedia user profiles manually annotated
Conclusion Linked Data as an additional knowledge source for resolving context eliminated a large number of incorrect annotations
Thank You! Questions? More about the project: project.euhttp:// project.eu Contact: