Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.

Slides:



Advertisements
Similar presentations
Protein Secondary Structure Prediction Using BLAST and Relaxed Threshold Rule Induction from Coverings Leong Lee Missouri University of Science and Technology,
Advertisements

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Statistics.  Statistically significant– When the P-value falls below the alpha level, we say that the tests is “statistically significant” at the alpha.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
ASYMPTOTIC PROPERTIES OF ESTIMATORS: PLIMS AND CONSISTENCY
21-May-15 Genetic Algorithms. 2 Evolution Here’s a very oversimplified description of how evolution works in biology Organisms (animals or plants) produce.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
Faculty Of Applied Science Simon Fraser University Cmpt 825 presentation Corpus Based PP Attachment Ambiguity Resolution with a Semantic Dictionary Jiri.
MAE 552 – Heuristic Optimization Lecture 6 February 6, 2002.
Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA) Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley.
Asssociation Rules Prof. Sin-Min Lee Department of Computer Science.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.
Test coverage Tor Stålhane. What is test coverage Let c denote the unit type that is considered – e.g. requirements or statements. We then have C c =
Game Trees: MiniMax strategy, Tree Evaluation, Pruning, Utility evaluation Adapted from slides of Yoonsuck Choe.
Bayesian Networks 4 th, December 2009 Presented by Kwak, Nam-ju The slides are based on, 2nd ed., written by Ian H. Witten & Eibe Frank. Images and Materials.
Reverse Engineering State Machines by Interactive Grammar Inference Neil Walkinshaw, Kirill Bogdanov, Mike Holcombe, Sarah Salahuddin.
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning CPSC 315 – Programming Studio Spring 2008 Project 2, Lecture 2 Adapted from slides of Yoonsuck.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Chapter 11Prepared by Samantha Gaies, M.A.1 –Power is based on the Alternative Hypothesis Distribution (AHD) –Usually, the Null Hypothesis Distribution.
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.
A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.
Measures of Variability In addition to knowing where the center of the distribution is, it is often helpful to know the degree to which individual values.
Extracting Semantic Constraint from Description Text for Semantic Web Service Discovery Dengping Wei, Ting Wang, Ji Wang, and Yaodong Chen Reporter: Ting.
CAS LX 502 8b. Formal semantics A fragment of English.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Lesley Charles November 23, 2009.
Zorica Stanimirović Faculty of Mathematics, University of Belgrade
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Games. Adversaries Consider the process of reasoning when an adversary is trying to defeat our efforts In game playing situations one searches down the.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Triplet Extraction from Sentences Technical University of Cluj-Napoca Conf. Dr. Ing. Tudor Mureşan “Jožef Stefan” Institute, Ljubljana, Slovenia Assist.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Sampling and estimation Petter Mostad
Inferential Statistics Inferential statistics allow us to infer the characteristic(s) of a population from sample data Slightly different terms and symbols.
Automated discovery in math Machine learning techniques (GP, ILP, etc.) have been successfully applied in science Machine learning techniques (GP, ILP,
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Commonsense Reasoning in and over Natural Language Hugo Liu, Push Singh Media Laboratory of MIT The 8 th International Conference on Knowledge- Based Intelligent.
B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.
Learning Taxonomic Relations from Heterogeneous Evidence Philipp Cimiano Aleksander Pivk Lars Schmidt-Thieme Steffen Staab (ECAI 2004)
Knowledge Structure Vijay Meena ( ) Gaurav Meena ( )
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
2.16 A researcher with a sample of 50 individuals with similar education but differing amounts of training hypothesizes that hourly earnings, EARNINGS,
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
SENG521 (Fall SENG 521 Software Reliability & Testing Preparing for Test (Part 6a) Department of Electrical & Computer Engineering,
Gspan: Graph-based Substructure Pattern Mining
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Database Management System
Sparsity Analysis of Term Weighting Schemes and Application to Text Classification Nataša Milić-Frayling,1 Dunja Mladenić,2 Janez Brank,2 Marko Grobelnik2.
Data Mining Lecture 11.
Chapter 15 QUERY EXECUTION.
Data Transformations targeted at minimizing experimental variance
Presentation transcript:

Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia

Motivation Help with populating a knowledge base / ontology (e.g. something like Cyc) with common-sense “facts” that would help with reasoning or querying – We’ll be interested in  concept 1, relation, concept 2  triples – E.g.  person, inhabit, country  tells us that a country is something that can be inhabited by a person, which is potentially useful We’d like to automatically extract such triples from a corpus of text – They are likely to contain slightly abstract concepts and aren’t mentioned directly in the text, but their specializations are – We will use WordNet to generalize concepts

Overview of the approach Parser + some heuristics Corpus of text List of  subject, predicate, object  triples List of concept triples WordNet List of frequent triples List of frequent, interesting triples Generalization, minimum support threshold Measures of interest

Associating input triples with WordNet concepts Our input was a list of  subject, predicate, object  triples – Each component is a phrase in natural language  European Union finance ministers, approved, convergence plans  – But we’d like each component to be a WordNet concept so that we’ll be able to use WordNet for generalization We use a simple heuristical approach: – Look for the longest subsequence of words that also happens to be the name of a WordNet concept Thus “finance minister”, not “minister” – Break ties by selecting the rightmost such sequence Thus “finance minister”, not “European Union” – Be prepared to normalize words when matching “ministers”  “minister” – Use only the nouns in WordNet when processing the subject and object, and only the verbs when processing the predicate

Identifying frequent triples Now we have a list of concept triples, each of which corresponds roughly to one clause in the input textual corpus Let u  v denote that v is a hypernym (direct or indirect) of u in WordNet (including u = v) support(s, v, o) := the number of concept triples (s', v', o') such that s'  s, v'  v, o'  o – Thus, a triple that supports  finance minister, approve, plan  also supports  executive, approve, idea  We want to identify all  s, v, o  whose support exceeds a certain threshold

Identifying frequent triples We use an algorithm inspired by Apriori However, we have to adapt it to prevet the generation of an untractably large amount of candidate triples (most of which would turn out to be infrequent) We use the depth of concepts in the WordNet hierarchy to order the search space Process triples in increasing order of the sum of the depths of their concepts – Each depth-sum requires one pass through the data

Identifying interesting triples Not all frequent triples are interesting – Generalizing one or more components of the triple leads to a higher (or at least equal) support – Thus the most general triples are also the most frequent, but they aren’t interesting E.g.  entity, act, entity  We are investigating heuristics to identify which triples are likely to be interesting – Let s be a concept and s' its hypernym. – Every input triple that supports s in its subject also supports s', but the other way around is usually not true. – We can think of the ratio support(s) / support(s') as a “conditional probability” P(s|s'). – So we might naively expect that P(s|s') support(s', v, o) input triples will support the triple  s, v, o . – But the actual support(s, v, o) can be quite different. If it is significantly higher, we conclude that s fits well together with v and o. – Thus, interestingness S (s, v, o) = support(s, v, o) / (P(s|s') support(s', v, o)). – Can be defined for v and o as well.

Identifying interesting triples But this measure of interestingness turns out to be too sensitive to outliers and quirks in the WordNet hierarchy Define the sv-neighborhood of a triple  s, v, o  as the set of all (frequent) triples with the same s and v. – The so- and vo-neighborhoods can be defined analogously. Possible criteria to select interesting triples now include: – A triple is interesting if it is the most interesting in two (or even all three) of its neighbourhoods (sv-, so- and vo-). – We might also require that the neighbourhoods be large enough.

Experiments: Frequent triples Input: 15.9 million  subject, predicate, object  triples extracted from the Reuters (RCV1) corpus For 11.8 of them, we were able to associate them with WordNet concepts. These are the basis of further processing. Frequent triple discovery: – Found 40 M frequent triples (at various levels of generalization) in about 60 hours of CPU time – Required 35 passes through the data (one for each depth-sum) – At no pass was the number of candidates generated greater than the number of actually frequent triples by more than 60%

Experiments: Interesting triples We manually evaluated the interestingness of all the frequent triples that are specializations of  person, inhabit, location  (there were 1321 of them) – On a scale of 1..5, we consider 4 and 5 as being interesting – If, instead of looking at all these triples, we select a smaller group of them on the basis of our interestingness measures, does the percentage of tripes scored 4 or 5 increase?

Conclusions and future work Frequent triples – Our frequent triple algorithm successfully handles large amounts of data – Its memory footprint only minimally exceeds the amount needed to store the actual frequent triples themselves Interesting triples – Our measure of interestingness has some potential but it remains to be seen what’s the right way to use it – Evaluation involving a larger set of triples is planned Ideas for future work: covering approaches – Suppose we fix s and v, and look where the corresponding o’s (i.e. those for which  s, v, o  is frequent) fall in the WordNet hypernym tree – We want to identify nodes whose subtrees cover a lot of these concepts but not too many other concepts (combined with an MDL criterion) – Alternative: think of the input concept triples as positive examples, and generate random triples of concepts as negative examples. Use this as the basis for a coverage problem similar to those used in training association rules.