Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.

Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia

Motivation Help with populating a knowledge base / ontology (e.g. something like Cyc) with common-sense “facts” that would help with reasoning or querying – We’ll be interested in  concept 1, relation, concept 2  triples – E.g.  person, inhabit, country  tells us that a country is something that can be inhabited by a person, which is potentially useful We’d like to automatically extract such triples from a corpus of text – They are likely to contain slightly abstract concepts and aren’t mentioned directly in the text, but their specializations are – We will use WordNet to generalize concepts

Overview of the approach Parser + some heuristics Corpus of text List of  subject, predicate, object  triples List of concept triples WordNet List of frequent triples List of frequent, interesting triples Generalization, minimum support threshold Measures of interest

Associating input triples with WordNet concepts Our input was a list of  subject, predicate, object  triples – Each component is a phrase in natural language  European Union finance ministers, approved, convergence plans  – But we’d like each component to be a WordNet concept so that we’ll be able to use WordNet for generalization We use a simple heuristical approach: – Look for the longest subsequence of words that also happens to be the name of a WordNet concept Thus “finance minister”, not “minister” – Break ties by selecting the rightmost such sequence Thus “finance minister”, not “European Union” – Be prepared to normalize words when matching “ministers”  “minister” – Use only the nouns in WordNet when processing the subject and object, and only the verbs when processing the predicate

Identifying frequent triples Now we have a list of concept triples, each of which corresponds roughly to one clause in the input textual corpus Let u  v denote that v is a hypernym (direct or indirect) of u in WordNet (including u = v) support(s, v, o) := the number of concept triples (s', v', o') such that s'  s, v'  v, o'  o – Thus, a triple that supports  finance minister, approve, plan  also supports  executive, approve, idea  We want to identify all  s, v, o  whose support exceeds a certain threshold

Identifying frequent triples We use an algorithm inspired by Apriori However, we have to adapt it to prevet the generation of an untractably large amount of candidate triples (most of which would turn out to be infrequent) We use the depth of concepts in the WordNet hierarchy to order the search space Process triples in increasing order of the sum of the depths of their concepts – Each depth-sum requires one pass through the data

Identifying interesting triples Not all frequent triples are interesting – Generalizing one or more components of the triple leads to a higher (or at least equal) support – Thus the most general triples are also the most frequent, but they aren’t interesting E.g.  entity, act, entity  We are investigating heuristics to identify which triples are likely to be interesting – Let s be a concept and s' its hypernym. – Every input triple that supports s in its subject also supports s', but the other way around is usually not true. – We can think of the ratio support(s) / support(s') as a “conditional probability” P(s|s'). – So we might naively expect that P(s|s') support(s', v, o) input triples will support the triple  s, v, o . – But the actual support(s, v, o) can be quite different. If it is significantly higher, we conclude that s fits well together with v and o. – Thus, interestingness S (s, v, o) = support(s, v, o) / (P(s|s') support(s', v, o)). – Can be defined for v and o as well.

Identifying interesting triples But this measure of interestingness turns out to be too sensitive to outliers and quirks in the WordNet hierarchy Define the sv-neighborhood of a triple  s, v, o  as the set of all (frequent) triples with the same s and v. – The so- and vo-neighborhoods can be defined analogously. Possible criteria to select interesting triples now include: – A triple is interesting if it is the most interesting in two (or even all three) of its neighbourhoods (sv-, so- and vo-). – We might also require that the neighbourhoods be large enough.

Experiments: Frequent triples Input: 15.9 million  subject, predicate, object  triples extracted from the Reuters (RCV1) corpus For 11.8 of them, we were able to associate them with WordNet concepts. These are the basis of further processing. Frequent triple discovery: – Found 40 M frequent triples (at various levels of generalization) in about 60 hours of CPU time – Required 35 passes through the data (one for each depth-sum) – At no pass was the number of candidates generated greater than the number of actually frequent triples by more than 60%

Experiments: Interesting triples We manually evaluated the interestingness of all the frequent triples that are specializations of  person, inhabit, location  (there were 1321 of them) – On a scale of 1..5, we consider 4 and 5 as being interesting – If, instead of looking at all these triples, we select a smaller group of them on the basis of our interestingness measures, does the percentage of tripes scored 4 or 5 increase?

Conclusions and future work Frequent triples – Our frequent triple algorithm successfully handles large amounts of data – Its memory footprint only minimally exceeds the amount needed to store the actual frequent triples themselves Interesting triples – Our measure of interestingness has some potential but it remains to be seen what’s the right way to use it – Evaluation involving a larger set of triples is planned Ideas for future work: covering approaches – Suppose we fix s and v, and look where the corresponding o’s (i.e. those for which  s, v, o  is frequent) fall in the WordNet hypernym tree – We want to identify nodes whose subtrees cover a lot of these concepts but not too many other concepts (combined with an MDL criterion) – Alternative: think of the input concept triples as positive examples, and generate random triples of concepts as negative examples. Use this as the basis for a coverage problem similar to those used in training association rules.

Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.

Similar presentations

Presentation on theme: "Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.

Similar presentations

Presentation on theme: "Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia."— Presentation transcript:

Similar presentations

About project

Feedback