Max Planck Institute for Informatics, Germany

Max Planck Institute for Informatics, Germany
Doctoral Advisor or Medical Condition: Towards Entity-specific Rankings of Knowledge Base Properties Simon Razniewski Max Planck Institute for Informatics, Germany Joint work with Vevake Balaraman (FBK Trento, Italy) Werner Nutt (FU Bozen-Bolzano, Italy)

Knowledge Bases Structured collections of facts about general world
Examples Wikidata DBpedia Google Knowledge Graph Microsoft Satori …

Date of birth 5 February 1985 Native language Portuguese
Position played Forward Nickname El Comandante Mass 80 kg Religion Roman catholic Official website

Editors: Information overload
More than 3000 properties can be asserted in Wikidata Doctoral advisor Medical condition Monastic order Handedness Names of pets …  Which ones actually matter?

The property ranking problem
Given: An entity A set of attributes/properties Question: Which properties are actually interesting for that entity?

Property Ranking: Applications
Knowledge base curation What the heck, we are missing the team of Ronaldo? Natural language generation Which properties should be mentioned (first)? Comparing informativeness of entries How good is data about Ronaldo compared with Maradona?

Related work Considerable work on entity ranking (e.g. Pagerank) and fact ranking (e.g. trivia generation) Property suggestion Association rules [Abedjan and Naumann, DBS 2013] Ad-hoc combination of rules No human evaluation Generic machine learning [Atzori and Dessi, FGCS 2016] Class level only Limited evaluation Frequency-based [Ahmeti, Razniewski, Polleres, ESWC 2017] Very simple heuristic

Contribution 1: Dataset
Properties: 100 most frequent non-ID properties Entities: Recently edited ones 350 triples of random entity and property judgments (Ronaldo, doctoral advisor, medical condition) Doctoral advisor Medical condition Cristiano Ronaldo dos Santos Aveiro (born 5 February 1985) is a Portuguese professional footballer who plays as a forward for Spanish club Real Madrid and the Portugal national team. Often considered the best player in the world, Ronaldo is the first player in history to win four European Golden Shoes. Which of the following two properties would be more interesting to know about for Cristiano Ronaldo?

Sample rows

Agreement distribution
Remainder: Focus on triples with >=0.8 agreement

Learning to Rank (L2R): Theory
3 core paradigms: Pointwise, pairwise, listwise L2R Pointwise: Learn an score function score(Ronaldo, positionPlayed) = 0.73 Pairwise: Learn a pairwise preference function preference(Ronaldo, positionPlayed, lastName) = lastName (Listwise: Directly learn lists from lists) Pointwise: Unstable - score dependent on framing Supervised pairwise: Requires too much training data (100 properties  5,000 pairs/entity, we have only 1/350)  Choice left: Unsupervised approximation of pairwise ranking

Baselines (1/2) Human frequency
Winner is property more frequently used for humans Occupation frequency (=Ahmeti et al. 2017) Winner is property more frequently used in profession Google count Winner is String returning more results in Google “Ronaldo doctoral advisor” vs “Ronaldo medical condition” Association rules (= Abedjan + Naumann 2013) Source code of Wikidata implementation by Abedjan and Naumann available

Baselines (2/2) Method Performance Random 50% Human frequency 60.6%
Occupation frequency 58.6% Google count 58% Property suggester 61.3% Annotator agreement 87.5%

How can we get better? Use property presence
Explore semantic similarity between entities and properties

Property presence: Idea
Hypothesis: Property presence indicates interestingness If true: Can use a model that predicts presence to predict interestingness (“transfer learning”) But can we predict property presence?

Property presence: Training
One regression classifier per property pair Input: Bag of words from entity Wikipedia article Training: 5000 entities that have Property 1 and not Property 2 5000 entities that have Property 2 and not Property 1 Precision on predicting presence: 94.8% Example: weights for position played vs. religious order

Property presence: Transfer
Precision in predicting interestingness: 72% 10% better than baselines  Expensive training – O(n²) classifiers pays off But: Still data-driven Misses similarity Soccer player and drafted by vs. military conflict

How can we get better? Predict property presence
Explore semantic similarity between entities and properties

Semantic similarity: Idea
Entities have Wikipedia articles Cristiano Ronaldo is professional footballer who… Properties have descriptions Position played: Position that someone played on for a team Use semantic similarity between entity articles and property descriptions as heuristic for interestingness

Semantic similarity: Implementation
Topic modelling = Represent texts as distributions of topics Common techniques: LSI and LDA (Latent Semantic Indexing/Latent Dirichlet Allocation)  Semantic similarity by cosine of topic vectors

Semantic similarity: LDA Examples
Ronaldo = 52%⋅ T6 + 12%⋅ T18 + 7%⋅ T26 + 6%⋅ T41 + … Member of sports team = 96%⋅ T6 + … Goals scored = 92% ⋅ T6 + … …

Semantic similarity: Performance
Precision on predicting interestingness: LDA: 60% LSI: 65.3% Pro: Great at discovering relevance/applicability Cons: Struggle with short property descriptions Similarity only mediocre proxy for interestingness

Ensemble classifiers Condorcet’s jury theorem (1785): The more people involved in a voting decision, the better If their judgment is better than random If are sufficiently statistically independent Machine learning: Combination of many weak classifiers can be beneficial Majority voting among Google count, LSI, LDA, Occupation frequency, and regression: 74% precision

Conclusion Semantic similarity great for detecting applicability
Predicting presence expensive, but worth Best method (ensemble): 74% precision Much better than baselines (~60%) Still notably worse than humans (87.5%) Possible extensions: Include an explicit notion of importance Acquire larger textual descriptions of properties Explore incorporation of formal semantics/constraints Dataset:

Appendix: Instability of pointwise scores
Setup 1 – Important properties: Country of origin, participant of, award received, date of birth, member of sports team, position played on team / speciality, country of citizenship Setup 2: - Tail properties: academic degree, image of grave, educated at, brother, hair color, military rank, religion

Appendix: Correlation between methods

Max Planck Institute for Informatics, Germany

Similar presentations

Presentation on theme: "Max Planck Institute for Informatics, Germany"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Max Planck Institute for Informatics, Germany

Similar presentations

Presentation on theme: "Max Planck Institute for Informatics, Germany"— Presentation transcript:

Similar presentations

About project

Feedback