Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.

Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

2  Determine outliers in information networks  Compare various algorithms which does the same

3  Eg Internet, Social Networking Sites  Nodes – characterized by feature values  Links - representative of relation between nodes

 Outliers – anomalies, novelties  Different kinds of outliers ◦ Global ◦ Contextual 4

6  Unified model considering both nodes and links  Community discovery and outlier detection are related processes

7  Treat each object as a multivariate data point  Use K components to describe normal community behavior and one component to denote outliers  Induce a hidden variable z i at each object indicating community  Treat network information as a graph  Model the graph as a Hidden Markov Random Field on z i  Find the local minimum of the posterior probability potential energy of the model.

8 community label Z outlier node feature X link structure W high-income: mean: 116k std: 35k low-income: mean: 20k std: 12k model parameters K: number of communitie s

9 SymbolDefinition I = {1,2,3….i,..M}Indices of the objects V = {v1,v2….v m }Set of objects S = {s1,s2,….s m }Given attributes of objects W M*M = {w ij }Adjacency matrix containing the weights of the links Z = {z 1,…..,z m }RVs for hidden labels of objects X = {x 1,…..,x m }RVs for observed data N i (i ∈ I)Neighborhood of object v i 1,….,k,….KIndices of normal communities Θ = {Θ 1, Θ 2,……, Θ k }R.Vs for model parameters

◦ Set of R.Vs X are conditionally independent given their labels P(X=S|Z) = ΠP(x i =s i |z i ) ◦ Kth normal community is characterized by a set of parameters P(x i =s i |z i =k) = P(x i =s i |Θ k ) ◦ Outliers are characterized by uniform distribution ◦ P(x i =s i |z i =0) = ρ0 ◦ Markov random field is defined over hidden variable Z ◦ P(z i |z I-{i} ) = P(z i |z Ni ) ◦ The equivalent Gibbs distribution is P(Z) = exp(-U(Z))*1/H 1 H 1 = normalizing constant, U(Z) = sum of clique potentials. ◦ Goal is to find the configuration of z that maximizes P(X=S|Z)P(Z) for a given Θ 10

11  Continuous Data ◦ Is modeled as Gaussian distribution ◦ Model parameters: mean, standard deviation  Text Data ◦ Is modeled as Multinomial distribution ◦ Model parameters: probability of a word appearing in a community

12 Given Θ, find Z that maximizes P(Z|X) Given Z, find Θ that maximizes P(X|Z) Initialize Z INFERENCE PARAMETER ESTIMATION Θ : model parameters Z: community labels

13  Calculate model parameters ◦ maximum likelihood estimation  Continuous ◦ mean: sample mean of the community ◦ standard deviation: square root of the sample variance of the community  Text ◦ probability of a word appearing in the community: empirical probability

14  Calculate Z i values ◦ Given Model parameters, ◦ Iteratively update the community labels of nodes at each timestep ◦ Select the label that maximizes P(Z|X,Z N )  Calculate P(Z|X,Z N ) values ◦ Both the node features and community labels of neighbors if Z indicates a normal community ◦ If the probability of a node belonging to any community is low enough, label it as an outlier

15  Setting Hyper parameters ◦ a 0 = threshold ◦ Λ = confidence in the network ◦ K = number of communities  Initialization ◦ Group outliers in clusters. ◦ It will eventually get corrected.

16  Data Generation ◦ Generate continuous data based on Gaussian distributions and generate labels according to the model ◦ Define r: percentage of outliers, K: number of communities  Baseline models ◦ GLODA: global outlier detection (based on node features only) ◦ DNODA: local outlier detection (check the feature values of direct neighbors) ◦ CNA: partition data into communities based on links and then conduct outlier detection in each community

18  Communities ◦ data mining, artificial intelligence, database, information analysis  Sub network of Conferences  Links: percentage of common authors among two conferences  Node features: publication titles in the conference  Sub network of Authors  Links: co-authorship relationship  Node features: titles of publications by an author

19 Community outliers: CVPR CIKM

20  Community Outliers  Community Outlier Detection QUESTIONS

21  On Community Outliers and their Efficient Detection in Information Networks – Gao, Liang, Fan, Wang, Sun, Han  Outlier detection – Irad Ben-Gal  Automated detection of outliers in real-world data – Last, Kandel  Outlier Detection for High Dimensional Data – Aggarwal, Yu

Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.

Similar presentations

Presentation on theme: "Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.

Similar presentations

Presentation on theme: "Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu."— Presentation transcript:

Similar presentations

About project

Feedback