1 DECK: Detecting Events from Web Click-through Data Ling Chen, Yiqun Hu, Wolfgang Nejdl Presented by Sebastian Föllmer.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Aggregating local image descriptors into compact codes
Clustering Categorical Data The Case of Quran Verses
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Dimension reduction (1)
Branch & Bound Algorithms
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Eigenvalues and eigenvectors
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Simple Neural Nets For Pattern Classification
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
On the Construction of Energy- Efficient Broadcast Tree with Hitch-hiking in Wireless Networks Source: 2004 International Performance Computing and Communications.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
FACE RECOGNITION, EXPERIMENTS WITH RANDOM PROJECTION
Unsupervised Learning
Cluster Analysis (1).
Robust estimation Problem: we want to determine the displacement (u,v) between pairs of images. We are given 100 points with a correlation score computed.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
October 8, 2013Computer Vision Lecture 11: The Hough Transform 1 Fitting Curve Models to Edges Most contours can be well described by combining several.
11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Summarized by Soo-Jin Kim
1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Presented by Tienwei Tsai July, 2005
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
CSE 185 Introduction to Computer Vision Pattern Recognition 2.
Digital Image Processing CCS331 Relationships of Pixel 1.
Line detection Assume there is a binary image, we use F(ά,X)=0 as the parametric equation of a curve with a vector of parameters ά=[α 1, …, α m ] and X=[x.
SINGULAR VALUE DECOMPOSITION (SVD)
Feature based deformable registration of neuroimages using interest point and feature selection Leonid Teverovskiy Center for Automated Learning and Discovery.
Elementary Linear Algebra Anton & Rorres, 9th Edition
Sparse Signals Reconstruction Via Adaptive Iterative Greedy Algorithm Ahmed Aziz, Ahmed Salim, Walid Osamy Presenter : 張庭豪 International Journal of Computer.
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Boosted Particle Filter: Multitarget Detection and Tracking Fayin Li.
Image Segmentation Shengnan Wang
UNIT #1 CHAPTERS BY JEREMY GREEN, ADAM PAQUETTEY, AND MATT STAUB.
Course14 Dynamic Vision. Biological vision can cope with changing world Moving and changing objects Change illumination Change View-point.
Irfan Ullah Department of Information and Communication Engineering Myongji university, Yongin, South Korea Copyright © solarlits.com.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
CIVET seminar Presentation day: Presenter : Park, GilSoon.
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
SIMILARITY SEARCH The Metric Space Approach
Interest Points EE/CSE 576 Linda Shapiro.
Principal Component Analysis (PCA)
Machine Learning Basics
Computer Vision Lecture 12: Image Segmentation II
Fitting Curve Models to Edges
K Nearest Neighbor Classification
Goodfellow: Chapter 14 Autoencoders
Feature space tansformation methods
第 九 章 影像邊緣偵測 9-.
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

1 DECK: Detecting Events from Web Click-through Data Ling Chen, Yiqun Hu, Wolfgang Nejdl Presented by Sebastian Föllmer

2 Introduction Today, the Web covers almost every event in the real world, particulary with the appareance of new Web data, such as the weblogs.

3 Introduction Aim : detect real events from the click-through data generated by Web search engines. In this way, we propose an effective algorithm which Detects Events from ClicK-through data (DECK).

4 Introduction Two reasons that click-through data is promising for event detection : 1)Web search engines give a huge volume of click-through data, and so effective knowledge discovery. 2)Click-through data is well formatted.

5 Introduction What is recorded during a Web click- through data ?

6 Introduction What is recorded during a Web click- through data ? An anonymous user identity

7 Introduction What is recorded during a Web click- through data ? The query issued by the user

8 Introduction What is recorded during a Web click- through data ? The time at which the query was submitted for search

9 Introduction What is recorded during a Web click- through data ? The URL of clicked search result

10 Introduction Detect events from click-through data is not a trivial problem : 1) The information provided is limited : URLs refer to the adresses of sites and not of pages : ->the same URL may have different semantics and correspond to different events. Similarly, the same query keywords may have been issued about differents events.

11 Introduction 2) A large amount of click-through data doesn’t represent necessary a real event. Table most frequent entries in the click-through data logged by AOL in March, 2006.

12 The Algorithm The algorithm DECK : What is given ? A collection of Web click-through data. What do we want ? To detect real events from this data. DECK proceeds in four steps, as we will see after.

13 The Algorithm First, we define a query session, which consists of a query issued by the user and a set of pages clicked by the user on the search result :

14 The Algorithm Algorithm 1 DECK. Input: A set of query sessions Output: A set of query session clusters corresponding to real events Description: 1: Transform query sessions to polar space 2: Estimate subspaces of query sessions using KNN-GPCA 3: for Each estimated subspace do 4: Project inliers to the subspace direction and orthogonal direction respectively 5: Compute the entropy of the distribution histogram in the two directions 6: if Interestingness of the subspace is less than some threshold then 7: Prune the subspace 8: end if 9: end for 10: for Each interesting subspace do 11: Perform mean shift clustering 12: end for 13: Return the clustering results

The Algorithm Algorithm 1 DECK. Input: A set of query sessions Output: A set of query session clusters corresponding to real events Description: 1: Transform query sessions to polar space 2: Estimate subspaces of query sessions using KNN-GPCA 3: for Each estimated subspace do 4: Project inliers to the subspace direction and orthogonal direction respectively 5: Compute the entropy of the distribution histogram in the two directions 6: if Interestingness of the subspace is less than some threshold then 7: Prune the subspace 8: end if 9: end for 10: for Each interesting subspace do 11: Perform mean shift clustering 12: end for 13: Return the clustering results 4 Steps : 1

16 The Algorithm Algorithm 1 DECK. Input: A set of query sessions Output: A set of query session clusters corresponding to real events Description: 1: Transform query sessions to polar space 2: Estimate subspaces of query sessions using KNN-GPCA 3: for Each estimated subspace do 4: Project inliers to the subspace direction and orthogonal direction respectively 5: Compute the entropy of the distribution histogram in the two directions 6: if Interestingness of the subspace is less than some threshold then 7: Prune the subspace 8: end if 9: end for 10: for Each interesting subspace do 11: Perform mean shift clustering 12: end for 13: Return the clustering results 4 Steps : 1 2

17 The Algorithm Algorithm 1 DECK. Input: A set of query sessions Output: A set of query session clusters corresponding to real events Description: 1: Transform query sessions to polar space 2: Estimate subspaces of query sessions using KNN-GPCA 3: for Each estimated subspace do 4: Project inliers to the subspace direction and orthogonal direction respectively 5: Compute the entropy of the distribution histogram in the two directions 6: if Interestingness of the subspace is less than some threshold then 7: Prune the subspace 8: end if 9: end for 10: for Each interesting subspace do 11: Perform mean shift clustering 12: end for 13: Return the clustering results 4 Steps : 1 2 3

18 The Algorithm Algorithm 1 DECK. Input: A set of query sessions Output: A set of query session clusters corresponding to real events Description: 1: Transform query sessions to polar space 2: Estimate subspaces of query sessions using KNN-GPCA 3: for Each estimated subspace do 4: Project inliers to the subspace direction and orthogonal direction respectively 5: Compute the entropy of the distribution histogram in the two directions 6: if Interestingness of the subspace is less than some threshold then 7: Prune the subspace 8: end if 9: end for 10: for Each interesting subspace do 11: Perform mean shift clustering 12: end for 13: Return the clustering results 4 Steps :

19 Step 1 : Polar representation Why choose polar space instead of Cartesian Coordinate System ? If we transform each query session S to a point (θ,r) such that - angle θ reflect the semantics of S - radius r reflect the occurring time of S, then query sessions of similar semantics should be mapped to points which have similar angles and locate along a line (subspace) passing through the origin.

20 Step 1 : Polar representation However, if we transform a query session S to a point (x,y) in Cartesian Coordinate System such that - x reflect the semantics of S - y reflect the occurring time of S -then queries sessions of similar semantics will be mapped to points which have similar x values and locate along a line parallel to the y axis. Lines parallel to the y axis (except the y axis itself) are not subspaces and cannot be detected by subspace estimation algorithms.

21 Step 1 : Polar representation The more similar the two query sessions are in semantics, the smaller the angle | θ 1 - θ 2 | is. In the same way, the closer S1 and S2 are in occuring time, the smaller the distance | r1 - r2 | is.

22 Step 1 : Polar Representation we define the semantic similarity between two query sessions by considering their similarities in not only query keywords but also clicked pages. The weight coefficient αЄ[0,1] is determined experimentally, it assigns different importance to the similarities of queries and clicked pages.

23 Step 1 : Polar Representation Example :

24 Step 1 : Polar Representation Example :

25 Step 1 : Polar representation We define the n x n semantic similarity matrix M=(m ij ) nxn with M ij = Sim(S i,S j ) The relative semantics of S i is represented by the n- dimension row vector R i =(m i1,m i2,…,m in ) of M

26 Step 1 : Polar representation In order to map the semantics of S i to an angle θ i in polar space, we need to reduce the dimension of R i to 1. For that, we perform the Principle Component Analysis (PCA) on M. Then, the first principal component obtained is used to preserve the dominant variance in semantic similarities. Let {f 1,f 2,…,f n } be the first principal component. Then θ i is computed as : where min j (f j ) and max j (f j ) are the minimum and maximum values respectively in the first principal component. θ i is restricted to [0, п/2].

27 Step 1 : Polar representation The radius r i is more simply given by : where min j (T(S j )) and max j (T(S j )) are respectively the earliest and latest occurring time of all query sessions. r i takes value in the range of [0,1].

28 Step 1 : Polar representation The polar transformation has the following two features: -Subspace consistency : The mapping from the semantics to an angle causes query sessions of similar semantics to lie on one and only one 1D subspace. Ex : points of S1 to S3 and points of S4 to S6 locate around the two dotted lines. The point of S7 appears as an outlier in the figure. - Cluster consistency : The mapping from the occurring time to a radius forces query sessions of similar semantics and similar occurring time to appear as clusters in subspaces. Ex : points of S1 to S3 form a cluster in the lower dotted line, while points of S4 to S6 distribute along the upper dotted line.

29 Step 2 : Subspace Estimation Now we have to find out the subspaces with similar senantic. In this way, we propose a new algorithm KNN-GPCA which improves the GPCA algorithmus GPCA has some advantages : -It doesn’t need an initialization -It can estimate subspaces without prior knowledge the number of them.

30 Step 2 : Subspace Estimation 2 complications 1 - estimating the number of subspaces : n - which is not easy with the presence of outliners 2 - estimating the normal vectors of subspaces : {b i }, i=1..n - which is not easy with the presence of outliers

31 Step 2 : Subspace Estimation So the performance of GPCA decreases with the number of outliers. To filter them out we assign weight coefficients to the Data. For each point, a weight is assigned based on the distribution of its K nearest neighbours.

32 Step 2 : Subspace Estimation Weight coefficient assignment : x i = one data point NN k (x i ) = K nearest neighbours of x i svar(NN k (x i )) = variance of the neighbours along the direction of the subspace nvar(NN k (x i )) = variance of the neighbours along the direction which is orthogonal to the subspace direction

33 Step 2 : Subspace Estimation Weight coefficient assignment : If x i is a true data point, both svar(NN k (x i )) and nvar(NN k (x i )) are small. So we define : S(NNk(xi)) = svar(NN k (x i )) + nvar(NN k (x i )) However, it may not be a true data point if its neighbors are spread along the orthogonal direction of the subspace : That’s why we define the ratio : R(NNk(xi)) = nvar(NNk(xi)) / svar(NNk(xi)) Which must be small for a true data. :

34 Step 2 : Subspace Estimation When W(xi) is closed to 1, xi represent a true data. When W(xi) is closed to 0, xi is a noise or an outlier. Finally, we assign the weight W(xi), taking values in [0,1]

35 Step 3 : Subspace Pruning After estimating subspaces, each subspace contains query sessions of similar topic. However, not every subspace is interesting for us, How to distinguish interesting from uninteresting subspaces ? In our polar space, it is represented by a temporal “burst” AND a semantical “burst”, and so a certain distribution of the data points in the subspace.

36 Step 3 : Subspace Pruning S 2 is an interesting subspace.

37 Step 3 : Subspace Pruning We don’t use a simple variance measure to define the interestingness of a subspace, because of certains events like periodical events, which have a small variance in the semantic direction, but a large variance in the temporal direction. So we employ the entropy measure to define the interestingness of a subspace : We project data points in the two directions and calculate the respective histograms of the distributions. (h 1,h 2,…,h m ) temporal histogram (v 1,v 2,…,v n ) semantical histogram

38 Step 3 : Subspace Pruning The interestingness of a subspace can be calculated as follows : where pЄ[0,1] is a weight (which can be determined in experiments to assign different importance to the entropy values in the two directions. For example, if p = 1, only the temporal “burst” is considered.) I(s i )Є[0,1] The smaller is the value of the entropy, the greater is the interestingness.

39 Step 3 : Subspace Pruning How to select the interesting subspaces ? Given a threshold ξ, subspace E i will be pruned as an uninteresting subspace if I(E i ) < ξ. We observed from our experimets, that an interesting subspace has a much greater I(E) value as an uninteresting one. That make the value of ξ easy to decide.

40 Step 4 : Cluster Generation In the remaining subspaces, events can be now detected by clustering data points based on the cluster consistency property explained previously. Many clustering techniques exist, but we employ a non-parametric clustering technique, called : Mean Shift Clustering.

41 Step 4 : Cluster Generation Mean Shift: For each data point, one performs a gradient ascent procedure on the local estimated density until convergence. The stationary points of this procedure represent the modes of the distribution. Furthermore, the data points associated (at least approximately) with the same stationary point are considered members of the same cluster.

42 Step 4 : Cluster Generation Mean shift procedure - Starting at data point xi, run the mean shift procedure to find the stationary points of the density function. - Superscripts denote the mean shift iteration, - the shaded and black dots denote the input data points and - the dotted circles denote the density estimation windows.

43 Step 4 : Cluster Generation Firstly, mean shift procedure is run with all the data points to find the stationary points of the density estimate. Secondly, discovered stationary points are pruned such that only local maxima are retained. The set of all points that converge to the same mode defined the same cluster. The returned clusters are expected to represent real events !

44 The Algorithm Algorithm 1 DECK. Input: A set of query sessions Output: A set of query session clusters corresponding to real events Description: 1: Transform query sessions to polar space 2: Estimate subspaces of query sessions using KNN-GPCA 3: for Each estimated subspace do 4: Project inliers to the subspace direction and orthogonal direction respectively 5: Compute the entropy of the distribution histogram in the two directions 6: if Interestingness of the subspace is less than some threshold then 7: Prune the subspace 8: end if 9: end for 10: for Each interesting subspace do 11: Perform mean shift clustering 12: end for 13: Return the clustering results 4 Steps :

45 Performance Study of DECK Description of the Data Set : Dataset : The real-life Web click-through data collected by AOL, from March to May events are used in the experiments. Then we randomly select query sessions which represent a real event or a non-real event, to generate five data sets, which respectively contain 5K, 10K, 20K, 50K and 100K query sessions.

46 Performance Study of DECK Result Analysis : We define :

47 Performance Study of DECK Experimental results : DECK : our algorithm. 2PClustering : first algorithm known about click-through data. DECK-GPCA : algorithm which employs the original GPCA to estimates subspaces (peformance of subspace estimation). DECK-NP : DECK with no pruning (performance of subspace pruning).

48 Performance Study of DECK Experimental results : We further evaluate the performance using the entropy measure : Better results of DECK : Since 2PClustering doesn’t prune any data, the entropy is higher than for DECK.

49 Conclusion DECK ROCKS Fragen ?