Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Modelling with databases. Database management systems (DBMS) Modelling with databases Coaching modelling with databases Advantages and limitations of.
Learning Relational Probability Trees Jennifer Neville David Jensen Lisa Friedland Michael Hay Presented by Andrew Tjang.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Combining Inductive and Analytical Learning Ch 12. in Machine Learning Tom M. Mitchell 고려대학교 자연어처리 연구실 한 경 수
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Introduction of Probabilistic Reasoning and Bayesian Networks
Xyleme A Dynamic Warehouse for XML Data of the Web.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Statistical Methods Chichang Jou Tamkang University.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 22 Jim Martin.
Methodology Conceptual Database Design
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
WEB OF SCIENCE now including the CONFERENCE PROCEEDINGS CITATION INDEXES.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Basic Data Mining Techniques
Data Mining Techniques
Welcome to the Web of Science tutorial By the end of this tutorial you should be able to: Do a basic search to find references Use search techniques to.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Inductive Logic Programming Includes slides by Luis Tari CS7741L16ILP.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Software School of Hunan University Database Systems Design Part III Section 5 Design Methodology.
CSCI 3140 Module 2 – Conceptual Database Design Theodore Chiasson Dalhousie University.
DBSQL 3-1 Copyright © Genetic Computer School 2009 Chapter 3 Relational Database Model.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.
For Monday Finish chapter 19 No homework. Program 4 Any questions?
Slides for “Data Mining” by I. H. Witten and E. Frank.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Machine Learning Concept Learning General-to Specific Ordering
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Data Mining and Decision Support
Machine Learning 5. Parametric Methods.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
1 Management Information Systems M Agung Ali Fikri, SE. MM.
Feature Generation and Selection in SRL Alexandrin Popescul & Lyle H. Ungar Presented By Stef Schoenmackers.
Concepts of Database Management, Fifth Edition Chapter 3: The Relational Model 2: SQL.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Relational Databases Today we will look at: Different ways of searching a database Creating queries Aggregate Queries More complex queries involving different.
Chapter 7. Classification and Prediction
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Mining: Concepts and Techniques Course Outline
Review of Week 1 Database DBMS File systems vs. database systems
Bayesian Learning Chapter
Spreadsheets, Modelling & Databases
Discriminative Probabilistic Models for Relational Data
Chapter 10 Content Analysis
Presentation transcript:

Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003

Link Prediction Link Prediction is an important problem arising in many domains –Web pages –Computers –Scientific publications –Organizations –People Being able to predict the presence of links or connections in a domain is both important and difficult to do well

Characteristics in Link Prediction Domains Their nature is inherently multi-relational –This makes the standard “flat” file domain representation inadequate Data is often noisy or partially observed –e.g. articles may be cited for any number of reasons which reasons are not fully observed

Typical Learning Approaches Assume one-table “flat” domain representation Process of feature creation is decoupled from feature selection (and is often performed manually) Relevant features may not be readily observed by human eyes

The “Full Join” Approach Perform a full join on the entire database and statistically analyze the entries –Both impractical and incorrect Size is prohibitive Notion of an object is lost (stored across multiple rows) Entries will be atomic attribute values, rather than results from a complex search Negates option to introduce intelligent search heuristics

The Relational Method Integrates standard statistical modeling (logistic regression) with a process for systematically generating features from relational data Feature generation is formulated as search in the space of relational database queries Space bias can be controlled by specifying valid query types –Aggregations or statistical operations –Groupings –Richer join conditions –Arg-max based queries Allows for discovery of complex, interesting relationships

Link Prediction in the Citeseer Domain Can be used as a citation recommendation service –User would provide an abstract, author names, possibly a partial reference list Citeseer provides a rich set of relational data –Texts of titles –Abstracts and documents –Citation information –Author names and affiliations –Conference or journal names

Methodology Couple the two main processes –Generation of feature candidates from relational data –Their selection with statistical model selection criteria

Relational Feature Generation Main principle of search formulation is based on the concept of refinement graphs Start with the most general clauses and progress by refining them into more specialized clauses

Relational Feature Generation – Refinement Graphs Directed acyclic graphs specifying search space Constrained by specifying legal clauses –Negation and recursion disallowed Structured by partial ordering of clauses A search node is expanded (refined) to produce the most general specializations ILP systems using refinement graph search usually apply two refinement operators –Add a predicate to a clause –A single variable substitution

Relational Feature Generation – Aggregates Query results are aggregated to produce scalar numeric values to be used in statistical learning Any statistical aggregate can be valid, but some are expected to be more useful than others –Count –Average –Max –Min –Mode –Empty Aggregations are considered for inclusion at each node, but not factored into further search

Relational Feature Selection Logistic Regression is used for binary classification problems Regression coefficients are learned to maximize the likelihood function Stepwise model selection and Bayesian Information Criterion (BIC) are used to avoid overfitting

Tasks and Data – IID Violation The relational structure violates the assumption of independence This can be remedied by choosing the right features When the right features are used, the observations are independent given the features

Two Prediction Tasks 1.The identity of all objects is known. Some link structure is known. Predict unobserved links. 2.New objects arrive. Predict their links. -What do we know about the objects? -Some of their links -Some of their attributes -This paper presents results for task 1

The Citeseer Environment 271,343 documents 1,092,200 citations Five data sets defined –Four data sets consist of links among documents containing a certain query phrase (e.g. “artificial intelligence”) –Fifth data set includes all documents

Learning Methodology Populate three relations Citation, Author and PublishedIn Sample 2,500 citations each of –Positive training examples (from available links) –Negative training examples (absence of a link) –Positive test examples –Negative test examples

Learning Methodology Remove citations from test set (but no other relevant information) Remove citations from training set (so answers are not contained in background information) Perform learning –Using citations only –Using all relevant information (citation, authors and venue)

Results : Training and Test set accuracies – balanced priors DatasetBK : CitationBK: All TrainTestTrainTest “artificial intelligence” “data mining” “information retrieval” “machine learning” Entire collection

The End