Presentation is loading. Please wait.

Presentation is loading. Please wait.

Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.

Similar presentations


Presentation on theme: "Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB."— Presentation transcript:

1 Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB

2 1.Background heterogeneous information networks multiple types of objects and links different kinds of attributes Clustering, so that is easy to retrieve Challenge: attribute values of objects are often incomplete(an object may contain only partial or even no observations for a given attribute set) the links of different types may carry different kinds of semantic meanings(each may have its own level of semantic importance in the clustering process) Classic clustering algorithm CANNOT work(k-means etc.)

3 2.1 The Data Structure heterogeneous information network G = (V, E,W), directed graph each node v ∈ V in the network corresponds to an object each link e ∈ E corresponds to a relationship between the linked objects weight denoted by w(e). Different from the traditional network: τ : V → A, a mapping function from object to object type φ : E → R, a mapping function from link to link type A is the object type set, and R is the link type set A R B <> B R^-1 A (directed)

4 2.1 The Data Structure Attributes: associated with objects Set X = {X1,..., XT }, attributes across all different types of objects Each object v ∈ V contains a subset of the attributes Observations set v[X] = {xv,1, xv,2,..., xv,N(x,v) }, N(x,v) is the total number of observations of attribute X attached with object v. attributes can be shared by different types of objects or unique. Set V x : object set that contains attribute X

5 2.2 The Clustering Problem maps every object in the network into a unified hidden space(a soft clustering) Two Challenges: attribute values of objects are often incomplete(an object may contain only partial or even no observations for a given attribute set) the links of different types may carry different kinds of semantic meanings(each may have its own level of semantic importance in the clustering process)

6 2.2 The Clustering Problem Example1:Bibliographic information network Object types : paper, author, venue Links types: paper->author,author->paper,paper->venue,venus->paper,author- >venue,venus->author,p->p,a->a,v->v; Attribute(only one): paper: text attribute(a set of words) Author and venus:none

7 2.2 The Clustering Problem For authors and venues, the only available information is from the papers linked to them; For papers, both text attributes and links of different types are provided. The author type is more important than the venue type for paper’cluster. The link type strength should be learned

8 2.2 The Clustering Problem EXAMPLE 2. Weather sensor network. Object type: precipitation sensor, temperature sensor Link type:p->p, t->t, p->t, t->p Attributes(multiple): precipitation, temperature A sensor may sometimes register none or multiple observations a sensor object contain only partial attribute

9 2.2 The Clustering Problem a network G = (V, E,W) , K clusters Subset of G, attributes x ∈ X, attribute observations {v[X]} for all objects goal: 1. to learn a soft clustering : creat a membership probability matrix, Θ |V |×K = ( θ v ) v ∈ V, Θ(v, k) denotes the probability of object v in cluster k , 0 ≤ Θ(v, k) ≤ 1 , ∑(k=1 to K) Θ(v, k) =1, θv is the K dimensional cluster membership vector for object v 2. to learn the strengths (importance weights) of different link types in determining the cluster memberships of the objects,γ|R|×1, where γ(r) is a real number and stands for the importance weight for the link type r ∈ R. Set that cluster number K is the best.

10 3.1 Model Overview a good clustering configuration Θ, should satisfy two properties: Given the clustering configuration, the observed attributes should be generated with a high probability. The clustering configuration should be highly consistent with the network structure. Given network G, relation strength vector γ , cluster component parameter β , the likelihood of the observations of all the attributes X ∈ X :

11 3.1 Model Overview Given G , γ , p(θ|G , γ ) can get Given β , θ , π x ∈ X p({v[X]}v ∈ VX |Θ,β) can get Goal : find the best parameters γ , β , and the best clustering configuration Θ that maximize the likelihood.

12 3.2.1 Single Attribute Let X be the only attribute we are interested in the network consider the attribute observation v[X] for each object v is generated from a mixture model each component is a probabilistic model that stands for a cluster parameters β to be learned, component weights denoted by θv given the network configuration Θ , the probability of all the observations {v[X]}v ∈ VX :

13 3.2.1 Single Attribute Text attribute with categorical distribution: Objects in the network contain text attributes in the form of a term list, from the vocabulary l = 1 to m. Each cluster k has a different term distribution, with the parameter k = (βk,1,...,βk,m), βk,l is the probability of term l appearing in cluster k the probability of observing all the current attribute values: C v,l denotes the count of term l that object v contains

14 Numerical attribute with Gaussian distribution objects in the network contain numerical observations in the form of a value list, from the domain R The kth cluster is a Gaussian distribution with parameters k = (μk, σ k 2 ) The probability density for all the observations for all objects:

15 3.2.2 Multiple Attributes multiple attributes in the network are specified by users, say X1,..., XT assuming the independence among these attributes the probability density of observed attribute values {v[X1]},..., {v[XT ]} for a given clustering configuration Θ:

16 3.3 Modeling Structural Consistency From the view of links, the more similar the two objects are in terms of cluster membership, the more likely they are connected by a link. For a link e = ∈ E, with type r = φ(e) ∈ R, denote the importance of the link type to the clustering process by a real number γ(r).(w(e) is specified in the network as input, γ(r) is defined on link types and needs to be learned) denote the consistency function of 2 cluster membership vectors i and j , link e under strength weights for each link type γ by a feature function f(i, j, e, γ).

17 3.3 Modeling Structural Consistency several desiderata for a good feature function : The value of the feature function f should increase with greater similarity of θi and θj. The value of the feature function f should decrease with greater importance of the link e, either in terms of its specified weight w(e), or learned importance γ(r ) The feature function should not be symmetric between its first two arguments i and j ( directed graph)

18 3.3 Modeling Structural Consistency is the cross entropy from θj to θi, which evaluates the deviation of vj from vi For a fixed value of γ(r), the value of H(θj, θi) is minimal and f is maximal the value of f decreases with increasing learned link type strength γ(r) or input link weight w(e) Γ >= 0, f <= 0

19 3.3 Modeling Structural Consistency Example: The weights of all links w(e) are 1 Cluster number : 3 Θ:given in the figure. Link type: write(author, paper) γ1 published by(paper, venue) γ2 written by(paper, author) γ3

20 3.3 Modeling Structural Consistency Objects 1 and 3 are more likely to belong to the first cluster, Object 4 is a neutral object, and Object 5 is more likely to belong to the third cluster. f(1, 3) = −0.4701γ3; f(1, 4) = −1.7174γ3; f(1, 5) = −2.3410γ3. f(1, 3) ≥ f(1, 4) ≥ f(1, 5). f(1, 2) = −0.4701γ2 ; f(1, 3) = −0.4701γ3. If γ2 f(1, 3). ( stronger link types are likely to exist only between objects that are very similar to each other ) f(1, 4) = −1.7174γ3, f(4, 1) = −1.0986γ1, f(1, 4) <> f(4, 1).

21 3.3 Modeling Structural Consistency a log-linear model: Z ( γ) is the partition function , makes the distribution function sum up to 1 : Z ( γ) =

22 3.4 The Unified Model Goal : determine the best clustering results Θ, the link type strengths γ and the cluster component parameters β , maximize the generative probability of attribute observations and the consistency with the network structure add a Gaussian prior to as a regularization to avoid overfitting new objective function :

23 4. THE CLUSTERING ALGORITHM assumption: all the types of links play an equally important role in the clustering process(γ = 1) update γ according to the average consistency of links of that type with the current clustering results achieve a good clustering and a reasonable strength vector for link type This iterative algorithm containing two steps: 1. fix the link type weights γ to the best value γ* ( determined in the last iteration ) ; determine the best clustering results Θ and the attribute parameters β for each cluster component , called it cluster optimization step: 2. fix the clustering configuration parameters Θ = Θ ∗ and β=β ∗ (determined in the last step); determine the best value of γ,call it link type strength learning step:

24 4.1 Cluster Optimization an EM-based algorithm to solve the formula: In the E-step, the probability of each observation x for each object v and each attribute X belonging to each cluster, usually called the hidden cluster label of the observation, Zv,x, is derived according to the current parameters Θ and β. In the M-step, the parameters Θ and β are updated according to the new membership for all the observations in the E-step. ( skip the details of mathematical formula and derivation process)

25 4.2 Link Type Strength Learning ( skip the details of mathematical formula and derivation process,since it’s too Complex)

26 4.3 Putting together: The GenClus Algorithm ——General Heterogeneous Network Clustering algorithm

27 5.Effectiveness Study use Normalized Mutual Information (NMI) to compare(evaluates the similarity between two partitions of the objects) use NetPLSA and iTopicModel as baselines network contains two types of objects, authors (A) and conferences (C); network contains objects corresponding to authors (A), conferences (C) and papers (P);

28 5.Effectiveness Study compare the clustering results of Gen-Clus with two baselines, by comparing the cluster labels with maximum probabilities with the ground truth 2 baselines: k-means algorithm, spectral clustering method Weather Sensor Network:network is synthetically generated, containing two types of objects: temperature (T) and precipitation (P) sensors

29 END THX


Download ppt "Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB."

Similar presentations


Ads by Google