Presentation is loading. Please wait.

Presentation is loading. Please wait.

Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis.

Similar presentations


Presentation on theme: "Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis."— Presentation transcript:

1 Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis Team IDS Lab., SNU 2008.08.14. 28 th VLDB Conference (2002) Yusuke Ohura Katsumi Takahashi Iko Pramudiono Masaru Kitsuregawa Institute of Industrial Science, University of Tokyo NTT Information Sharing Platform Laboratories

2 Copyright  2008 by CEBT Index  Introduction  Internet Yellow Page Service and its Problems  Log Analysis of iTOWNPAGE  Query Expansion Using Web Log Mining  Implementation and Evaluation  Conclusion Center for E-Business Technology

3 Copyright  2008 by CEBT Introduction  Rapid progress on storage capacity and processor performance Lead to a chance to analyze huge log data left on Web servers But still.. – No technical report on huge log data mining is available to public  This paper reports.. Results of log data mining and query expansion experiments on the huge commercial Web service – iTOWNPAGE An online Japanese telephone directory system(Yellow Page Service) Center for E-Business Technology

4 Copyright  2008 by CEBT Index  Introduction  Internet Yellow Page Service and its Problems  Log Analysis of iTOWNPAGE  Query Expansion Using Web Log Mining  Implementation and Evaluation  Conclusion Center for E-Business Technology

5 Copyright  2008 by CEBT Internet Yellow Page Service and its Problems  iTOWNPAGE Internet version of TOWNPAGE Center for E-Business Technology

6 Copyright  2008 by CEBT Internet Yellow Page Service and its Problems  Problems Found Through Statistical Analysis Log data – Access log on iTOWNPAGE from 1 st February to 30 th June 2000 – 450 million lines, 200GB 1 st issue – Regarding sessions with multiple categories 27.2% of search sessions with category as their variable input are multiple category sessions 75.2% of them used non sibling categories which do not share the parent in the category hierarchy iTOWNPAGE provides 2 nd issue – The case when users can not get any results for their search requests Center for E-Business Technology Overview of Search Requests on iTOWNPAGE

7 Copyright  2008 by CEBT Index  Introduction  Internet Yellow Page Service and its Problems  Log Analysis of iTOWNPAGE  Query Expansion Using Web Log Mining  Implementation and Evaluation  Conclusion Center for E-Business Technology

8 Copyright  2008 by CEBT Log Analysis of iTOWNPAGE  Session The sequence of requests from a user Set of two continuous requests (within 30 mins interval) are regarded as the same session Session vector i th session vector Center for E-Business Technology

9 Copyright  2008 by CEBT Log Analysis of iTOWNPAGE  K-means Clustering Algorithm for clustering sessions  Can not predict the number of clusters in advance Improve the algorithm so that it can dynamically decide the number of clusters to be generated  Improved K-means algorithm The 1 st input vector becomes the centroid vector of the first cluster C 1 – becomes the member of the cluster C 1 for each successive input vector, – Similarity with existing clusters C 1 … C k is calculated with formula If the similarity is below threshold, new cluster is generated If not, input vector becomes a member of the cluster with the highest similarity  Centroid vector is recalculated with formula The process is iteratively executed until it converges Center for E-Business Technology

10 Copyright  2008 by CEBT K-means Clustering Algorithm  Overview An algorithm to cluster n objects based on attributes into k partitions It assumes that the object attributes form a vector space The objective it tries to achieve is to minimize total intra-cluster variance Algorithm – Partition the input points into k initial sets, either at random or using some heuristic data – Calculate the centroid(mean point) of each set – Construct a new partition by associating each point with the closest centroid – Then, centroids are recalculated for the new clusters, and algorithm repeated by alternate application of these two steps until convergence Center for E-Business Technology

11 Copyright  2008 by CEBT Log Analysis of iTOWNPAGE Only display categories whose number of sessions are more than TH cat of total sessions for that cluster many non-sibling categories in the category hierarchy appear in the clusters Can infer that the search session with the same input such as “Hotels” are performed on various demands and contexts The clustering of web access logs is effective to understand the user behavior Center for E-Business Technology

12 Copyright  2008 by CEBT Index  Introduction  Internet Yellow Page Service and its Problems  Log Analysis of iTOWNPAGE  Query Expansion Using Web Log Mining  Implementation and Evaluation  Conclusion Center for E-Business Technology

13 Copyright  2008 by CEBT Query Expansion Using Web Log Mining  Motivation There are many requests end with no result – Best solution : recommend another address Possible only when coordinate information for addresses is available – Another solution : recommend categories Need for similarity between categories  Can be extracted by clustering the user access log There are many sessions consist of non-sibling categories – Propose another expansion method for recommending categories, not similar but having some relation to the input category Center for E-Business Technology

14 Copyright  2008 by CEBT Query Expansion Using Web Log Mining  Strategies for Query Expansion Intra-Category Recommendation – Selects sibling categories that appear in major clusters of CAT input 1.Find clusters that have CAT input as a member in the order of the appearance ratio of CAT input 2.Choose a sibling category that has the most count from each cluster until the number of sibling categories reaches MAX sibl Inter-Category Recommendation 1.Selects non-sibling categories that appear in major clusters of CAT input 2.Choose the maximum non-sibling category of CAT input from each clusters up to MAX non-sibl in the same way of “Intra-Category” step Center for E-Business Technology

15 Copyright  2008 by CEBT Index  Introduction  Internet Yellow Page Service and its Problems  Log Analysis of iTOWNPAGE  Query Expansion Using Web Log Mining  Implementation and Evaluation  Conclusion Center for E-Business Technology

16 Copyright  2008 by CEBT Implementation and Evaluation Center for E-Business Technology

17 Copyright  2008 by CEBT Implementation and Evaluation  Use another log data to test expansion method – 1 st July to 20 th July 2000 Firstly, test data is converted into sessions – “Category A -> Category B -> Category C” Transition relations are extracted from the sessions – “Category A -> Category B”, “Category B -> Category C” Features Center for E-Business Technology N : # of test relations S : # of successful expansions after the expansion test C i : # of expanded categories displayed for i-th test request

18 Copyright  2008 by CEBT Index  Introduction  Internet Yellow Page Service and its Problems  Log Analysis of iTOWNPAGE  Query Expansion Using Web Log Mining  Implementation and Evaluation  Conclusion Center for E-Business Technology

19 Copyright  2008 by CEBT Conclusion  Experimental results of mining access log from a huge commercial site  Propose a query expansion method based on clustering of user requests  Enhance K-means clustering algorithm  Two-step expansion method Recommendation for similar categories Recommendation for related categories although they are non-similar in category hierarchy Center for E-Business Technology

20 Copyright  2008 by CEBT Summary  Pros It uses real data from commercial web-site Simple and useful  Cons Nothing special – Clustering user sessions Two step expansion method? Center for E-Business Technology


Download ppt "Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis."

Similar presentations


Ads by Google