1 DECK: Detecting Events from Web Click-through Data Ling Chen, Yiqun Hu, Wolfgang Nejdl Presented by Sebastian Föllmer.

1 DECK: Detecting Events from Web Click-through Data Ling Chen, Yiqun Hu, Wolfgang Nejdl Presented by Sebastian Föllmer

2 Introduction Today, the Web covers almost every event in the real world, particulary with the appareance of new Web data, such as the weblogs.

3 Introduction Aim : detect real events from the click-through data generated by Web search engines. In this way, we propose an effective algorithm which Detects Events from ClicK-through data (DECK).

4 Introduction Two reasons that click-through data is promising for event detection : 1)Web search engines give a huge volume of click-through data, and so effective knowledge discovery. 2)Click-through data is well formatted.

5 Introduction What is recorded during a Web click- through data ?

6 Introduction What is recorded during a Web click- through data ? An anonymous user identity

7 Introduction What is recorded during a Web click- through data ? The query issued by the user

8 Introduction What is recorded during a Web click- through data ? The time at which the query was submitted for search

9 Introduction What is recorded during a Web click- through data ? The URL of clicked search result

10 Introduction Detect events from click-through data is not a trivial problem : 1) The information provided is limited : URLs refer to the adresses of sites and not of pages : ->the same URL may have different semantics and correspond to different events. Similarly, the same query keywords may have been issued about differents events.

11 Introduction 2) A large amount of click-through data doesn’t represent necessary a real event. Table 2. 10 most frequent entries in the click-through data logged by AOL in March, 2006.

12 The Algorithm The algorithm DECK : What is given ? A collection of Web click-through data. What do we want ? To detect real events from this data. DECK proceeds in four steps, as we will see after.

13 The Algorithm First, we define a query session, which consists of a query issued by the user and a set of pages clicked by the user on the search result :

14 The Algorithm Algorithm 1 DECK. Input: A set of query sessions Output: A set of query session clusters corresponding to real events Description: 1: Transform query sessions to polar space 2: Estimate subspaces of query sessions using KNN-GPCA 3: for Each estimated subspace do 4: Project inliers to the subspace direction and orthogonal direction respectively 5: Compute the entropy of the distribution histogram in the two directions 6: if Interestingness of the subspace is less than some threshold then 7: Prune the subspace 8: end if 9: end for 10: for Each interesting subspace do 11: Perform mean shift clustering 12: end for 13: Return the clustering results

The Algorithm Algorithm 1 DECK. Input: A set of query sessions Output: A set of query session clusters corresponding to real events Description: 1: Transform query sessions to polar space 2: Estimate subspaces of query sessions using KNN-GPCA 3: for Each estimated subspace do 4: Project inliers to the subspace direction and orthogonal direction respectively 5: Compute the entropy of the distribution histogram in the two directions 6: if Interestingness of the subspace is less than some threshold then 7: Prune the subspace 8: end if 9: end for 10: for Each interesting subspace do 11: Perform mean shift clustering 12: end for 13: Return the clustering results 4 Steps : 1

16 The Algorithm Algorithm 1 DECK. Input: A set of query sessions Output: A set of query session clusters corresponding to real events Description: 1: Transform query sessions to polar space 2: Estimate subspaces of query sessions using KNN-GPCA 3: for Each estimated subspace do 4: Project inliers to the subspace direction and orthogonal direction respectively 5: Compute the entropy of the distribution histogram in the two directions 6: if Interestingness of the subspace is less than some threshold then 7: Prune the subspace 8: end if 9: end for 10: for Each interesting subspace do 11: Perform mean shift clustering 12: end for 13: Return the clustering results 4 Steps : 1 2

17 The Algorithm Algorithm 1 DECK. Input: A set of query sessions Output: A set of query session clusters corresponding to real events Description: 1: Transform query sessions to polar space 2: Estimate subspaces of query sessions using KNN-GPCA 3: for Each estimated subspace do 4: Project inliers to the subspace direction and orthogonal direction respectively 5: Compute the entropy of the distribution histogram in the two directions 6: if Interestingness of the subspace is less than some threshold then 7: Prune the subspace 8: end if 9: end for 10: for Each interesting subspace do 11: Perform mean shift clustering 12: end for 13: Return the clustering results 4 Steps : 1 2 3

18 The Algorithm Algorithm 1 DECK. Input: A set of query sessions Output: A set of query session clusters corresponding to real events Description: 1: Transform query sessions to polar space 2: Estimate subspaces of query sessions using KNN-GPCA 3: for Each estimated subspace do 4: Project inliers to the subspace direction and orthogonal direction respectively 5: Compute the entropy of the distribution histogram in the two directions 6: if Interestingness of the subspace is less than some threshold then 7: Prune the subspace 8: end if 9: end for 10: for Each interesting subspace do 11: Perform mean shift clustering 12: end for 13: Return the clustering results 4 Steps : 1 2 3 4

19 Step 1 : Polar representation Why choose polar space instead of Cartesian Coordinate System ? If we transform each query session S to a point (θ,r) such that - angle θ reflect the semantics of S - radius r reflect the occurring time of S, then query sessions of similar semantics should be mapped to points which have similar angles and locate along a line (subspace) passing through the origin.

20 Step 1 : Polar representation However, if we transform a query session S to a point (x,y) in Cartesian Coordinate System such that - x reflect the semantics of S - y reflect the occurring time of S -then queries sessions of similar semantics will be mapped to points which have similar x values and locate along a line parallel to the y axis. Lines parallel to the y axis (except the y axis itself) are not subspaces and cannot be detected by subspace estimation algorithms.

21 Step 1 : Polar representation The more similar the two query sessions are in semantics, the smaller the angle | θ 1 - θ 2 | is. In the same way, the closer S1 and S2 are in occuring time, the smaller the distance | r1 - r2 | is.

22 Step 1 : Polar Representation we define the semantic similarity between two query sessions by considering their similarities in not only query keywords but also clicked pages. The weight coefficient αЄ[0,1] is determined experimentally, it assigns different importance to the similarities of queries and clicked pages.

23 Step 1 : Polar Representation Example :

24 Step 1 : Polar Representation Example :

25 Step 1 : Polar representation We define the n x n semantic similarity matrix M=(m ij ) nxn with M ij = Sim(S i,S j ) The relative semantics of S i is represented by the n- dimension row vector R i =(m i1,m i2,…,m in ) of M

26 Step 1 : Polar representation In order to map the semantics of S i to an angle θ i in polar space, we need to reduce the dimension of R i to 1. For that, we perform the Principle Component Analysis (PCA) on M. Then, the first principal component obtained is used to preserve the dominant variance in semantic similarities. Let {f 1,f 2,…,f n } be the first principal component. Then θ i is computed as : where min j (f j ) and max j (f j ) are the minimum and maximum values respectively in the first principal component. θ i is restricted to [0, п/2].

27 Step 1 : Polar representation The radius r i is more simply given by : where min j (T(S j )) and max j (T(S j )) are respectively the earliest and latest occurring time of all query sessions. r i takes value in the range of [0,1].

28 Step 1 : Polar representation The polar transformation has the following two features: -Subspace consistency : The mapping from the semantics to an angle causes query sessions of similar semantics to lie on one and only one 1D subspace. Ex : points of S1 to S3 and points of S4 to S6 locate around the two dotted lines. The point of S7 appears as an outlier in the figure. - Cluster consistency : The mapping from the occurring time to a radius forces query sessions of similar semantics and similar occurring time to appear as clusters in subspaces. Ex : points of S1 to S3 form a cluster in the lower dotted line, while points of S4 to S6 distribute along the upper dotted line.

29 Step 2 : Subspace Estimation Now we have to find out the subspaces with similar senantic. In this way, we propose a new algorithm KNN-GPCA which improves the GPCA algorithmus GPCA has some advantages : -It doesn’t need an initialization -It can estimate subspaces without prior knowledge the number of them.

30 Step 2 : Subspace Estimation 2 complications 1 - estimating the number of subspaces : n - which is not easy with the presence of outliners 2 - estimating the normal vectors of subspaces : {b i }, i=1..n - which is not easy with the presence of outliers

31 Step 2 : Subspace Estimation So the performance of GPCA decreases with the number of outliers. To filter them out we assign weight coefficients to the Data. For each point, a weight is assigned based on the distribution of its K nearest neighbours.

32 Step 2 : Subspace Estimation Weight coefficient assignment : x i = one data point NN k (x i ) = K nearest neighbours of x i svar(NN k (x i )) = variance of the neighbours along the direction of the subspace nvar(NN k (x i )) = variance of the neighbours along the direction which is orthogonal to the subspace direction

33 Step 2 : Subspace Estimation Weight coefficient assignment : If x i is a true data point, both svar(NN k (x i )) and nvar(NN k (x i )) are small. So we define : S(NNk(xi)) = svar(NN k (x i )) + nvar(NN k (x i )) However, it may not be a true data point if its neighbors are spread along the orthogonal direction of the subspace : That’s why we define the ratio : R(NNk(xi)) = nvar(NNk(xi)) / svar(NNk(xi)) Which must be small for a true data. :

34 Step 2 : Subspace Estimation When W(xi) is closed to 1, xi represent a true data. When W(xi) is closed to 0, xi is a noise or an outlier. Finally, we assign the weight W(xi), taking values in [0,1]

35 Step 3 : Subspace Pruning After estimating subspaces, each subspace contains query sessions of similar topic. However, not every subspace is interesting for us, How to distinguish interesting from uninteresting subspaces ? In our polar space, it is represented by a temporal “burst” AND a semantical “burst”, and so a certain distribution of the data points in the subspace.

36 Step 3 : Subspace Pruning S 2 is an interesting subspace.

37 Step 3 : Subspace Pruning We don’t use a simple variance measure to define the interestingness of a subspace, because of certains events like periodical events, which have a small variance in the semantic direction, but a large variance in the temporal direction. So we employ the entropy measure to define the interestingness of a subspace : We project data points in the two directions and calculate the respective histograms of the distributions. (h 1,h 2,…,h m ) temporal histogram (v 1,v 2,…,v n ) semantical histogram

38 Step 3 : Subspace Pruning The interestingness of a subspace can be calculated as follows : where pЄ[0,1] is a weight (which can be determined in experiments to assign different importance to the entropy values in the two directions. For example, if p = 1, only the temporal “burst” is considered.) I(s i )Є[0,1] The smaller is the value of the entropy, the greater is the interestingness.

39 Step 3 : Subspace Pruning How to select the interesting subspaces ? Given a threshold ξ, subspace E i will be pruned as an uninteresting subspace if I(E i ) < ξ. We observed from our experimets, that an interesting subspace has a much greater I(E) value as an uninteresting one. That make the value of ξ easy to decide.

40 Step 4 : Cluster Generation In the remaining subspaces, events can be now detected by clustering data points based on the cluster consistency property explained previously. Many clustering techniques exist, but we employ a non-parametric clustering technique, called : Mean Shift Clustering.

41 Step 4 : Cluster Generation Mean Shift: For each data point, one performs a gradient ascent procedure on the local estimated density until convergence. The stationary points of this procedure represent the modes of the distribution. Furthermore, the data points associated (at least approximately) with the same stationary point are considered members of the same cluster.

42 Step 4 : Cluster Generation Mean shift procedure - Starting at data point xi, run the mean shift procedure to find the stationary points of the density function. - Superscripts denote the mean shift iteration, - the shaded and black dots denote the input data points and - the dotted circles denote the density estimation windows.

43 Step 4 : Cluster Generation Firstly, mean shift procedure is run with all the data points to find the stationary points of the density estimate. Secondly, discovered stationary points are pruned such that only local maxima are retained. The set of all points that converge to the same mode defined the same cluster. The returned clusters are expected to represent real events !

44 The Algorithm Algorithm 1 DECK. Input: A set of query sessions Output: A set of query session clusters corresponding to real events Description: 1: Transform query sessions to polar space 2: Estimate subspaces of query sessions using KNN-GPCA 3: for Each estimated subspace do 4: Project inliers to the subspace direction and orthogonal direction respectively 5: Compute the entropy of the distribution histogram in the two directions 6: if Interestingness of the subspace is less than some threshold then 7: Prune the subspace 8: end if 9: end for 10: for Each interesting subspace do 11: Perform mean shift clustering 12: end for 13: Return the clustering results 4 Steps : 1 2 3 4

45 Performance Study of DECK Description of the Data Set : Dataset : The real-life Web click-through data collected by AOL, from March to May 2006. 35 events are used in the experiments. Then we randomly select query sessions which represent a real event or a non-real event, to generate five data sets, which respectively contain 5K, 10K, 20K, 50K and 100K query sessions.

46 Performance Study of DECK Result Analysis : We define :

47 Performance Study of DECK Experimental results : DECK : our algorithm. 2PClustering : first algorithm known about click-through data. DECK-GPCA : algorithm which employs the original GPCA to estimates subspaces (peformance of subspace estimation). DECK-NP : DECK with no pruning (performance of subspace pruning).

48 Performance Study of DECK Experimental results : We further evaluate the performance using the entropy measure : Better results of DECK : Since 2PClustering doesn’t prune any data, the entropy is higher than for DECK.

49 Conclusion DECK ROCKS Fragen ?

1 DECK: Detecting Events from Web Click-through Data Ling Chen, Yiqun Hu, Wolfgang Nejdl Presented by Sebastian Föllmer.

Similar presentations

Presentation on theme: "1 DECK: Detecting Events from Web Click-through Data Ling Chen, Yiqun Hu, Wolfgang Nejdl Presented by Sebastian Föllmer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 DECK: Detecting Events from Web Click-through Data Ling Chen, Yiqun Hu, Wolfgang Nejdl Presented by Sebastian Föllmer.

Similar presentations

Presentation on theme: "1 DECK: Detecting Events from Web Click-through Data Ling Chen, Yiqun Hu, Wolfgang Nejdl Presented by Sebastian Föllmer."— Presentation transcript:

Similar presentations

About project

Feedback