Declarative Analysis of Noisy Information Networks Walaa Eldin Moustafa Galileo Namata Amol Deshpande Lise Getoor University of Maryland.

Slides:



Advertisements
Similar presentations
Three-Step Database Design
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Discriminative Training of Markov Logic Networks
Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.
+ Multi-label Classification using Adaptive Neighborhoods Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.
Fast Algorithms For Hierarchical Range Histogram Constructions
Relational Clustering for Entity Resolution Queries Indrajit Bhattacharya, Louis Licamele and Lise Getoor University of Maryland, College Park.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Entity Profiling with Varying Source Reliabilities
NetSci07 May 24, 2007 Entity Resolution in Network Data Lise Getoor University of Maryland, College Park.
Foundations of Comparative Analytics for Uncertainty in Graphs Lise Getoor, University of Maryland Alex Pang, UC Santa Cruz Lisa Singh, Georgetown University.
Daozheng Chen 1, Mustafa Bilgic 2, Lise Getoor 1, David Jacobs 1, Lilyana Mihalkova 1, Tom Yeh 1 1 Department of Computer Science, University of Maryland,
On Computing Compression Trees for Data Collection in Wireless Sensor Networks Jian Li, Amol Deshpande and Samir Khuller Department of Computer Science,
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
Presented by Zeehasham Rasheed
Probabilistic Databases Amol Deshpande, University of Maryland.
Data Mining – Intro.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Entity Resolution for Big Data Lise Getoor University of Maryland College Park, MD Ashwin Machanavajjhala Duke University Durham, NC
G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
IJCAI 2003 Workshop on Learning Statistical Models from Relational Data First-Order Probabilistic Models for Information Extraction Advisor: Hsin-His Chen.
Particle Filters for Shape Correspondence Presenter: Jingting Zeng.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
On Node Classification in Dynamic Content-based Networks.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Page 1 Inferring Relevant Social Networks from Interpersonal Communication Munmun De Choudhury, Winter Mason, Jake Hofman and Duncan Watts WWW ’10 Summarized.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Jun Li, Peng Zhang, Yanan Cao, Ping Liu, Li Guo Chinese Academy of Sciences State Grid Energy Institute, China Efficient Behavior Targeting Using SVM Ensemble.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Data Mining and Decision Support
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Feature Generation and Selection in SRL Alexandrin Popescul & Lyle H. Ungar Presented By Stef Schoenmackers.
Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.
Efficient Evaluation of XQuery over Streaming Data
What Is Cluster Analysis?
Semi-Supervised Clustering
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
RE-Tree: An Efficient Index Structure for Regular Expressions
Probabilistic Data Management
Associative Query Answering via Query Feature Similarity
Structure and Content Scoring for XML
Panagiotis G. Ipeirotis Luis Gravano
Probabilistic Databases
Structure and Content Scoring for XML
A Framework for Testing Query Transformation Rules
Relax and Adapt: Computing Top-k Matches to XPath Queries
Presentation transcript:

Declarative Analysis of Noisy Information Networks Walaa Eldin Moustafa Galileo Namata Amol Deshpande Lise Getoor University of Maryland

Outline Motivations/Contributions Framework Declarative Language Implementation Results Related and Future Work

Motivation

Users/objects are modeled as nodes, relationships as edges The observed networks are noisy and incomplete. – Some users may have more than one account – Communication may contain a lot of spam Missing attributes, links, having multiple references to the same entity Need to extract underlying information network.

Inference Operations Attribute Prediction – To predict values of missing attributes Link Prediction – To predict missing links Entity Resolution – To predict if two references refer to the same entity These prediction tasks can use: – Local node information – Relational information surrounding the node

Attribute Prediction Automatic Rule Refinement for Information Extraction Join Optimization of Information Extraction Output: Quality Matters! A Statistical Model for Multilingual Entity Detection and Tracking Why Not? Tracing Lineage Beyond Relational Operators An Annotation Management System for Relational Databases Language Model Based Arabic Word Segmentation. DB NL? Legend Use links between nodes (collective attribute prediction) [Sen et al., AI Magazine 2008] Task: Predict topic of the paper

Attribute Prediction Automatic Rule Refinement for Information Extraction Join Optimization of Information Extraction Output: Quality Matters! A Statistical Model for Multilingual Entity Detection and Tracking Why Not? Tracing Lineage Beyond Relational Operators An Annotation Management System for Relational Databases Language Model Based Arabic Word Segmentation. DB NL? Legend Task: Predict topic of the paper P1 P2

Attribute Prediction Automatic Rule Refinement for Information Extraction Join Optimization of Information Extraction Output: Quality Matters! A Statistical Model for Multilingual Entity Detection and Tracking Why Not? Tracing Lineage Beyond Relational Operators An Annotation Management System for Relational Databases Language Model Based Arabic Word Segmentation. DB NL? Legend Task: Predict topic of the paper P2 P1

Link Prediction Goal: Predict new links Using local similarity Using relational similarity [Liben-Nowell et al., CIKM 2003] Divesh Srivastava Vladislav Shkapenyuk Nick Koudas Avishek Saha Graham Cormode Flip Korn Lukasz Golab Theodore Johnson

Entity Resolution Goal: to deduce that two references refer to the same entity Can be based on node attributes (local) – e.g. string similarity between titles or author names Local information only may not be enough Jian Li

Entity Resolution William Roberts Petre Stoica Jian Li Prabhu Babu Amol Deshpande Samir Khuller Barna Saha Jian Li Use links between the nodes (collective entity resolution) [Bhattacharya et al., TKDD 2007]

Joint Inference Each task helps others get better predictions. How to combine the tasks? – One after other (pipelined), or interleaved? GAIA: – A Java library for applying multiple joint AP, LP, ER learning and inference tasks: [Namata et al., MLG 2009, Namata et al., KDUD 2009] – Inference can be pipelined or interleaved.

Our Goal and Contributions Motivation: To support declarative network inference Desiderata: – User declaratively specifies the prediction features Local features Relational features – Declaratively specify tasks Attribute prediction, Link prediction, Entity resolution – Specify arbitrary interleaving or pipelining – Support for complex prediction functions Handle all that efficiently

Outline Motivations/Contributions Framework Declarative Language Implementation Results Related and Future Work

Unifying Framework Specify the domain Compute features Make Predictions, and Compute Confidence in the Predictions Choose Which Predictions to Apply For attribute prediction, the domain is a subset of the graph nodes. For link prediction and entity resolution, the domain is a subset of pairs of nodes.

Unifying Framework Specify the domain Compute features Make Predictions, and Compute Confidence in the Predictions Choose Which Predictions to Apply Local: word frequency, income, etc. Relational: degree, clustering coeff., no. of neighbors with each attribute value, common neighbors between pairs of nodes, etc.

Unifying Framework Specify the domain Compute features Make Predictions, and Compute Confidence in the Predictions Choose Which Predictions to Apply Attribute prediction: the missing attribute Link prediction: add link or not? Entity resolution: merge two nodes or not?

Unifying Framework Specify the Domain Compute Features Make Predictions, and Compute Confidence in the Predictions Choose Which Predictions to Apply After predictions are made, the graph changes: Attribute prediction changes local attributes. Link prediction changes the graph links. Entity resolution changes both local attributes and graph links.

Outline Motivations/Contributions Framework Declarative Language Implementation Results Related and Future Work

Datalog Use Datalog to express: – Domains – Local and relational features Extend Datalog with operational semantics (vs. fix-point semantics) to express: – Predictions (in the form of updates) – Iteration

Specifying Features Degree: Degree(X, COUNT ) :-Edge(X, Y) Number of Neighbors with attribute ‘A’ NumNeighbors(X, COUNT ) :− Edge(X, Y), Node(Y, Att=’A’) Clustering Coefficient NeighborCluster(X, COUNT ) :−Edge(X,Y), Edge(X,Z), Edge(Y,Z) ClusteringCoeff(X, C) :−NeighborCluster(X,N), Degree(X,D), C=2*N/(D*(D-1)) Jaccard Coefficient IntersectionCount(X, Y, COUNT ) :−Edge(X, Z), Edge(Y, Z) UnionCount(X, Y, D) :−Degree(X,D1), Degree(Y,D2), D=D1+D2-D3, IntersectionCount(X, Y, D3) Jaccard(X, Y, J) :−IntersectionCount(X, Y, N), UnionCount(X, Y, D), J=N/D

Specifying Domains Domains are used to restrict the space of computation for the prediction elements. Space for this feature is |V| 2 Similarity(X, Y, S) :−Node(X, Att=V1), Node(Y, Att=V1), S=f(V1, V2) Using this domain the space becomes |E|: DOMAIN D(X,Y) :- Edge(X, Y) Other DOMAIN predicates: – Equality – Locality sensitive hashing – String similarity joins – Traverse edges

Feature Vector Features of prediction elements are combined in a single predicate to create the feature vector: DOMAIN D(X, Y) :- … { P1(X, Y, F1) :- … … Pn(X, Y, Fn) :- … Features(X, Y, F1, …, Fn) :- P1(X, Y, F1), …, Pn(X, Y, Fn) }

Update Operation DEFINE Merge(X, Y) { INSERT Edge(X, Z) :- Edge(Y, Z) DELETE Edge(Y, Z) UPDATE Node(X, A=ANew) :- Node(X,A=AX), Node(Y,A=AY), ANew=(AX+AY)/2 UPDATE Node(X, B=BNew) :- Node(X,B=BX), Node(X,B=BX), BNew=max(BX,BY) DELETE Node(Y) } Merge(X, Y) :- Features (X, Y, F1,…,Fn), predict-ER(F1,…,Fn) = true, confidence-ER(F1,…,Fn) > 0.95

Prediction and Confidence Functions The prediction and confidence functions are user defined functions Can be based on logistic regression, Bayes classifier, or any other classification algorithm The confidence is the class membership value – In logistic regression, the confidence can be the value of the logistic function – In Bayes classifier, the confidence can be the posterior probability value

Iteration Iteration is supported by ITERATE construct. Takes the number of iterations as a parameter, or * to iterate until no more predictions. ITERATE (*) { MERGE(X,Y) :-Features (X, Y, F1,…,Fn), predict-ER(F1,…,Fn) = true, confidence-ER(F1,…,Fn) IN TOP 10% }

Pipelining DOMAIN ER(X,Y) :- …. { ER1(X,Y,F1) :- … ER2(X,Y,F1) :- … Features-ER(X,Y,F1,F2) :- … } DOMAIN LP(X,Y) :- …. { LP1(X,Y,F1) :- … LP2(X,Y,F1) :- … Features-LP(X,Y,F1,F2) :- … } ITERATE(*) { INSERT EDGE(X,Y) :- FT-LP(X,Y,F1,F2), predict-LP(X,Y,F1,F2), confidence-LP(X,Y,F1,F2 IN TOP 10% } ITERATE(*) { MERGE(X,Y) :- FT-ER(X,Y,F1,F2), predict-ER(X,Y,F1,F2), confidence-ER(X,Y,F1,F2) IN TOP 10% }

Interleaving DOMAIN ER(X,Y) :- …. { ER1(X,Y,F1) :- … ER2(X,Y,F1) :- … Features-ER(X,Y,F1,F2) :- … } DOMAIN LP(X,Y) :- …. { LP1(X,Y,F1) :- … LP2(X,Y,F1) :- … Features-LP(X,Y,F1,F2) :- … } ITERATE(*) { INSERT EDGE(X,Y) :- FT-LP(X,Y,F1,F2), predict-LP(X,Y,F1,F2), confidence-LP(X,Y,F1,F2 IN TOP 10% MERGE(X,Y) :- FT-ER(X,Y,F1,F2), predict-ER(X,Y,F1,F2), confidence-ER(X,Y,F1,F2) IN TOP 10% }

Outline Motivations/Contributions Framework Declarative Language Implementation Results Related and Future Work

Implementation Prototype based on Java Berkeley DB Implemented a query parser, plan generator, query evaluation engine Incremental maintenance: – Aggregate/non-aggregate incremental maintenance – DOMAIN maintenance

Incremental Maintenance Predicates in the program correspond to materialized tables (key/value maps). Every set of changes done by AP, LP, or ER are logged into two change tables ΔNodes and ΔEdges. – Insertions: |Record | +1 | – Deletions: |Record | -1 | – Updates: deletion followed by an insertion Aggregate maintenance is performed by aggregating the change table then refreshing the old table. DOMAIN: DOMAIN L(X):- Subgoals of L { P1(X,Y) :- Subgoals of P1 } L(X) :- Subgoals of L P1’(X) :- L(X), Subgoals of P1 P1(X) :- L(X) >> Subgoals of P1

Outline Motivations/Contributions Framework Declarative Language Implementation Results Related and Future Work

Synthetic Experiements Synthetic graphs. Generated using forest fire, and preferential attachment generation models. Three tasks: – Attribute Prediction, Link Prediction and Entity Resolution Two approaches: – Recomputing features after every iteration – Incremental maintenance Varied parameters: – Graph size – Graph density – Confidence threshold (update size)

Changing Graph Size Varied the graph size from 20K nodes and 200K edges to 100K nodes and 1M edges

Comparison with Derby Compared the evaluation of 4 features: degree, clustering coefficient, common neighbors and Jaccard.

Real-world Experiment Real-world PubMed graph – Set of publications from the medical domain, their abstracts, and citations 50,634 publications, 115,323 citation edges Task: Attribute prediction – Predict if the paper is categorized as Cognition, Learning, Perception or Thinking Choose top 10% predictions after each iteration, for 10 iterations Incremental: 28 minutes. Recompute: 42 minutes

Program DOMAIN Uncommitted(X):-Node(X,Committed=‘no’) { ThinkingNeighbors(X,Count ):- Edge(X,Y), Node(Y,Label=‘Thinking’) PerceptionNeighbors(X,Count ):- Edge(X,Y), Node(Y,Label=‘Perception’) CognitionNeighbors(X,Count ):- Edge(X,Y), Node(Y,Label=‘Cognition’) LearningNeighbors(X,Count ):- Edge(X,Y), Node(Y,Label=‘Learning’) Features-AP(X,A,B,C,D,Abstract):- ThinkingNeighbors(X,A), PerceptionNeighbors(X,B), CognitionNeighbors(X,C), LearningNeighbors(X,D),Node(X,Abstract, _,_) } ITERATE(10) { UPDATE Node(X,_,P,‘yes’):- Features-AP(X,A,B,C,D,Text),P = predict- AP(X,A,B,C,D,Text),confidence-AP(X,A,B,C,D,Text) IN TOP 10% }

Outline Motivations/Contributions Framework Declarative Language Implementation Results Related and Future Work

Related Work Dedupalog [Arasu et al., ICDE 2009]: – Datalog-based entity resolution User defines hard and soft rules for deduplication System satisfies hard rules and minimizes violations to soft rules when deduplicating references Swoosh [Benjelloun et al., VLDBJ 2008]: – Generic Entity resolution Match function for pairs of nodes (based on a set of features) Merge function determines which pairs should be merged

Conclusions and Ongoing Work Conclusions: – We built a declarative system to specify graph inference operations – We implemented the system on top of Berkeley DB and implemented incremental maintenance techniques Future work: – Direct computation of top-k predictions – Multi-query evaluation (especially on graphs) – Employing a graph DB engine (e.g. Neo4j) – Support recursive queries and recursive view maintenance

References [Sen et al., AI Magazine 2008] – Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, Tina Eliassi-Rad: Collective Classification in Network Data. AI Magazine 29(3): (2008) [Liben-Nowell et al., CIKM 2003] – David Liben-Nowell, Jon M. Kleinberg: The link prediction problem for social networks. CIKM [Bhattacharya et al., TKDD 2007] – I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM TKDD, 1:1– 36, [Namata et al., MLG 2009] – G. Namata and L. Getoor: A Pipeline Approach to Graph Identification. MLG Workshop, [Namata et al., KDUD 2009] – G. Namata and L. Getoor: Identifying Graphs From Noisy and Incomplete Data. SIGKDD Workshop on Knowledge Discovery from Uncertain Data, [Arasu et al., ICDE 2009] – A. Arasu, C. Re, and D. Suciu. Large-scale deduplication with constraints using dedupalog. In ICDE, 2009 [Benjelloun et al., VLDBJ 2008] – O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang,and J. Widom. Swoosh: a generic approach to entity resolution. The VLDB Journal, 2008.