Chapter 2 Modeling and Finding Abnormal Nodes. How to define abnormal nodes ? One plausible answer is : –A node is abnormal if there are no or very few.

Slides:



Advertisements
Similar presentations
NP-Hard Nattee Niparnan.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Sahar Mosleh PageCalifornia State University San Marcos 1 Introductory Concepts This section of the course introduces the concept of digital circuits and.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
SOFTWARE TESTING. INTRODUCTION  Software Testing is the process of executing a program or system with the intent of finding errors.  It involves any.
Using the Rule Normal Quantile Plots
Theoretical Probability Distributions We have talked about the idea of frequency distributions as a way to see what is happening with our data. We have.
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
1 Conditional XPath, the first order complete XPath dialect Maarten Marx Presented by: Einav Bar-Ner.
Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.
Complexity 26-1 Complexity Andrei Bulatov Interactive Proofs.
Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
CHAPTER 3 The Church-Turing Thesis
Gene Regulatory Networks - the Boolean Approach Andrey Zhdanov Based on the papers by Tatsuya Akutsu et al and others.
Selected Topics in Data Networking Graph Representation.
Copyright © Cengage Learning. All rights reserved.
Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
Propositional Calculus Math Foundations of Computer Science.
C How to Program, 6/e Summary © by Pearson Education, Inc. All Rights Reserved.
Software Testing and QA Theory and Practice (Chapter 4: Control Flow Testing) © Naik & Tripathy 1 Software Testing and Quality Assurance Theory and Practice.
Chapter 3: Central Tendency
Path testing Path testing is a “design structural testing” in that it is based on detailed design & the source code of the program to be tested. The methodology.
Network Measures Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Measures Klout.
Equivalence Class Testing
Modeling and Finding Abnormal Nodes (chapter 2) 駱宏毅 Hung-Yi Lo Social Network Mining Lab Seminar July 18, 2007.
Data Modeling Using the Entity-Relationship Model
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
The EZ way to write a prospectus, thesis, publishable paper John M. Hoenig, Ph.D. Department of Fisheries Science.
Modeling, Searching, and Explaining Abnormal Instances in Multi-Relational Networks Chapter 1. Introduction Speaker: Cheng-Te Li
Copyright © Cengage Learning. All rights reserved.
Systems Architecture I1 Propositional Calculus Objective: To provide students with the concepts and techniques from propositional calculus so that they.
Introduction Chapter 0. Three Central Areas 1.Automata 2.Computability 3.Complexity.
INF 384 C, Spring 2009 Ontologies Knowledge representation to support computer reasoning.
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
Understanding PML Paulo Pinheiro da Silva. PML PML is a provenance language (a language used to encode provenance knowledge) that has been proudly derived.
INTRODUCTION TO THE THEORY OF COMPUTATION INTRODUCTION MICHAEL SIPSER, SECOND EDITION 1.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
Copyright © Cengage Learning. All rights reserved. CHAPTER 10 GRAPHS AND TREES.
Integrals  In Chapter 2, we used the tangent and velocity problems to introduce the derivative—the central idea in differential calculus.  In much the.
Copyright © Cengage Learning. All rights reserved. CHAPTER 7 FUNCTIONS.
Initial Design of Entity Types for the COMPANY Database Schema Based on the requirements, we can identify four initial entity types in the COMPANY database:
GRAPHS THEROY. 2 –Graphs Graph basics and definitions Vertices/nodes, edges, adjacency, incidence Degree, in-degree, out-degree Subgraphs, unions, isomorphism.
Pixel Connectivity Pixel connectivity is a central concept of both edge- and region- based approaches to segmentation The notation of pixel connectivity.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
Propositional Calculus CS 270: Mathematical Foundations of Computer Science Jeremy Johnson.
C. Lawrence Zitnick Microsoft Research, Redmond Devi Parikh Virginia Tech Bringing Semantics Into Focus Using Visual.
Chapter 12 Introduction to Analysis of Variance PowerPoint Lecture Slides Essentials of Statistics for the Behavioral Sciences Eighth Edition by Frederick.
NP-complete Problems Prof. Sin-Min Lee Department of Computer Science.
Chapter 3: Organizing Data. Raw data is useless to us unless we can meaningfully organize and summarize it (descriptive statistics). Organization techniques.
Lecture 7: Foundations of Query Languages Tuesday, January 23, 2001.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
1 CS 430: Information Discovery Lecture 5 Ranking.
Complexity 24-1 Complexity Andrei Bulatov Interactive Proofs.
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
High-level Database Models Prof. Yin-Fu Huang CSIE, NYUST Chapter 4.
Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.
Chapter 12 Introduction to Analysis of Variance
The simultaneous evolution of author and paper networks
Knowledge Representation Techniques
Software Testing.
Control Flow Testing Handouts
Handouts Software Testing and Quality Assurance Theory and Practice Chapter 4 Control Flow Testing
Copyright © Zeph Grunschlag,
Entity-Relationship Modeling
Handouts Software Testing and Quality Assurance Theory and Practice Chapter 6 Domain Testing
Outline of the Chapter Basic Idea Outline of Control Flow Testing
Efficient Estimation of Word Representation in Vector Space
Presentation transcript:

Chapter 2 Modeling and Finding Abnormal Nodes

How to define abnormal nodes ? One plausible answer is : –A node is abnormal if there are no or very few nodes in the network similar to it based on some similarity measure. Other possibility. (Discuss by us) 2.0-1

The type of the nodes e.g. two nodes are similar if they are of the same or similar type. How do we measure the similarity of nodes ? (simple methods) 2.0-2

The type of directly connected nodes. e.g. two nodes are similar if they are connected to similar types of nodes. (cont. 1) How do we measure the similarity of nodes ? (simple methods) 2.0-3

The type of indirectly connected nodes via path of certain length. e.g. two nodes can be similar if the sets of the nodes that are within three steps away are similar. (cont. 2) How do we measure the similarity of nodes ? (simple methods) 2.0-4

The type of their links or edges. e.g. two nodes are similar if they have the same types of links. (cont. 3) How do we measure the similarity of nodes ? (simple methods) 2.0-5

The type of indirectly links via path of certain length. e.g. two nodes can be similar if the sets of links that are within three steps away are similar. (cont. 4) How do we measure the similarity of nodes ? (simple methods) 2.0-6

The network statistics of the nodes. e.g. two nodes are similar if they have the same degree, or connect to the same amount of nodes. (cont. 5) How do we measure the similarity of nodes ? (simple methods) 2.0-7

Above proposal might not be able to capture the deeper meaning of the nodes. Our real interest is to find nodes that’s contain abnormal semantics or play different roles in the network

Syntactic semantics The best way for us to model the semantics of the nodes is to adopt the concept of syntactic semantics. The semantics of a node is not determined by its label or name, instead can be represented by the role it plays in the network or its surrounding network structure

Syntactic semantics We propose to judge whether two nodes are similar by checking if they play similar roles in the network. To determine these roles, we analyzing the different combinations of surrounding nodes and links together with their network statistics. (cont. 1)2.0-10

3 stages of modeling and finding abnormal nodes 1.Structure modeling or feature selection. –The system automatically selects a set of features to represent the surrounding network structure of nodes. 2.Dependency computation or feature value generation. –For this stage we design a set of different models to compute the dependency between the features and the nodes in the MRN

3.The system tries to find the abnormal nodes by looking for those with abnormal semantic profiles. (cont. 1) stages of modeling and finding abnormal nodes

Feature Selection Stage Example (by examining Figure 1.2 ): We can conclude several things about each nodes in the figure 1.2. –A1 published two journal papers (P1, P3) and one of them cites the other. A1 belongs to organization O1 has a colleague A3, and co- authored one paper with A4. –A2 published two journal papers (P4, P5) and one of them cites the other. A2 belongs to organization O2 has a colleague A3, and co- authored one paper with A4.

(cont. 1)2.1-2 –A3 published one paper P2 (no citation). A3 belongs to two organization O1 and O2, and has two colleague A1 and A2. –A4 published two journal papers (P3, P4), one of them cites another paper and other is cited by another paper. A4 co-authored with two persons (A4 and A2). Based on the above description, it is not very hard to recognize that A3 has the most abnormal semantics. It is possible for humans to successfully identify abnormal nodes if we can somehow summarize roles of them.

We need a systematic method to model the semantics or the nodes to make them comparable and contrastable automatically. Two methods: –Individual path as features –Path type (better) How to model the semantics or roles of the nodes

We start by systematically listing the one and two-step paths of the four author nodes: –A1: A1 - writes - P1 - published_in - J1 (A1 writes P1 which is published in J1) A1 - writes - P3 - published_in - J1 (A1 writes P3 which is published in J1) A1 - writes - P1 - cites -1 - P3 (A1 writes P1 which is cited by P3) A1 - writes - P3 - cites - P1 (A1 writes P3 which cites P1) A1 - writes - P3 - writes -1 - A4 (A1 writes P3 which is written by A4) A1 - belongs - O1 - belongs -1 - A3 (A1 belongs to O1 which is the org. of A3) Individual paths as features

–A2: A2 - writes - P4 - published_in - J1 (A2 writes P4 which is published in J1) A2 - writes - P5 - published_in - J2 (A2 writes P5 which is published in J2) A2 - writes - P4 - cites -1 - P5 (A2 writes P4 which is cited by P5) A2 - writes - P5 - cites - P4 (A2 writes P5 which cites P4) A2 - writes - P4 - writes -1 - A4 (A2 writes P4 which is written by A4) A2 - belongs - O2 - belongs -1 - A3 (A2 belongs to O2 which is the org. of A3) –A3: A3 - writes - P2 (A3 writes P2) A3 - belongs - O1 - belongs -1 - A1 (A3 belongs to O1 which is the org. of A1) A3 - belongs - O2 - belongs -1 - A2 (A3 belongs to O2 which is the org. of A2) –A4: A4 - writes - P3 - published_in - J1 (A4 writes P3 which is published in J1) A4 - writes - P4 - published_in - J1 (A4 writes P4 which is published in J1) A4 - writes - P3 - cites - P1 (A4 writes P3 which cites P1) A4 - writes - P4 - cites -1 - P5 (A4 writes P4 which is cited by P5) A4 - writes - P3 - writes -1 - A1 (A4 writes P3 which is written by A1) A4 - writes - P4 - writes -1 - A2 (A4 writes P4 which is written by A2) Individual paths as features (cont. 1)

Each path in a network can be translated into a standard logical notation by representing nodes as constants and links via binary predicates. Those predicates contain meanings and can be translated into natural language. –For example: A1 - writes - P3 - cites - P1 => writes(A1,P3) ^ cites(P3,P1) Individual paths as features (cont. 3)

Treating all paths in the network as binary features. Thus, in a network of k different paths, each node can be represented by a k-dimensional binary feature vector –Assigning true to the paths the given node participates in. –Assigning false to the ones it does not Individual paths as features (cont. 4) PathA1P3P1 writes(A1,P3)TTF cites(P3,P1)FTT writes(A1,P1)TFT

Two problems : (treat all path as binary features) –Overfitting :Overfitting Since each path is unique, the only nodes sharing a particular path feature would be those participating in the path, which would make these profiles useless for comparing nodes inside the path with the one outside of it. For example, cites(P2,P1) ^ published_in(P1,J1) and cites(P2,P1) ^ published_in(P1,J2) –Time and Space complexity : A large semantic graph can easily contain millions of paths, and computation/storage in such high dimensional space would be costly Individual paths as features (cont. 5)

Define equivalence classes between different paths. –Whether two individual paths are considered to be of the same type will depend on one of several similarity measures we can choose. view a set of paths as equivalent if they use the same order of relations ( relation-only constraint ) –cites(P2,P1) ^ published_in(P1,J1) –cites(P2,P1) ^ published_in(P1,J2) –cites(P2,P3) ^ published_in(P3,J1) view a set of paths as equivalent if they go though the same nodes. –friends(Person1,Person2) ^ direct(Person2,Movie1) –colleagues(Person1,Person2) ^ act(Person2,Movie1) Path Types

How to generate a meaningful and representative set of path types? –Variable relaxation : cites(P2,P1) ^ published_in(P1,J1) => cites(P2,P1) ^ published_in(P1,?X) => “paper P2 cites paper P1 that is published in some journal” cites(P2,P1) ^ ?Y(P1,?X) => “paper P2 cites paper P1 that is published in some journal” Path Types is more useful as features to compare or contrast different instances or nodes Path Types (cont. 1)

Feature Value Computation A major advantage of using path types is that we do not limit ourselves to binary features. We can apply statistical methods to determine the dependence between a path and a node and use it as the corresponding feature value.

Performing Random Experiments on an MRN Three Random Experiments : –Random Experiment 1 (RE1) : Randomly pick a path from the MRN. The probability of each path to be selected is 1/|Path| |Path| is the total number of paths in the MRN.

Performing Random Experiments on an MRN –Random Experiment 2 (RE2) : First randomly pick a node in the MRN, And then randomly pick a path that starts from the selected node. –Random Experiment 3 (RE3) : First randomly pick a path type in the MRN, And then randomly pick a path among the ones that are of that path type. (cont. 1)

Performing Random Experiments on an MRN Based on the single path output of a random experiment, we can then introduce two random variables S and PT : –S : The starting node of this selected path (the number of possible realizations in S equals the total number of nodes in the MRN ) –PT : The path type of this selected path (the number of possible realizations in PT equals the total number of path types ) (cont. 2)

Measuring Node/Path Dependence via Contribution Take the path type write(?X,?Y) as example : –An instance A1 might occur in many path of this type (say write(A1,P1)~write(A1,P99)) –Another instance A2 might occur in a few (Say 1 time) –We can say that A1 contributes 99%, A2 1% and the rest 0% to this “writing a paper” path type –In this case we can say that the dependency measure between the node A1 and path type writes is 0.99.

Measuring Node/Path Dependence via Contribution We have essentially computed a conditional probability of random variable S given PT based on RE1 : –P RE1 ( S = A1 / PT = writes ) (cont. 1)

Measuring Node/Path Dependence via Contribution Define the contribution k of an instance s to a path type p t as the conditional probability that given a path with a certain path type p t is randomly selected (base on Random Experiment k), this path starts from s : –contribution k (s,p t ) = p k (S=s/PT=p t ) –The contribution value therefore encodes the dependency information between a node and a path type. (cont. 2)

Measuring Node/Path Dependence via Contribution Note that we consider x only as a starting point in a path and not at other positions. –Because if x is in the middle of a path, then it will be counted as the starting point of two shorter paths as exemplified below : –Similarly, if x is at the end of a path, then it is the starting node of the inverse path as well. (cont. 3)

Semantic Profile An excerpt of the semantic profile (based on contribution 1 ) for the director Ang Lee taken from a movie dataset : (cont. 4) contribution 1 Relation-only path type [direct, has_actor, child_of] [direct, has_authors] [direct, has_actor, live_with] [direct, remake] [write, has_cinematographer] [direct, has_actor, has_lover]