25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”

Slides:

Advertisements

Similar presentations

Advertisements

Information Retrieval (IR) on the Internet. Contents  Definition of IR  Performance Indicators of IR systems  Basics of an IR system  Some IR Techniques.

Chapter 5: Introduction to Information Retrieval

Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,

Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

Aki Hecht Seminar in Databases (236826) January 2009

Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.

1 Chapter 19: Information Retrieval. ©Silberschatz, Korth and Sudarshan19.2Database System Concepts - 5 th Edition, Sep 2, 2005 Chapter 19: Information.

Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.

Compare&Contrast: Using the Web to Discover Comparable Cases for News Stories Presenter: Aravind Krishna Kalavagattu.

Chapter 19: Information Retrieval

1 Discovering Unexpected Information from Your Competitor’s Web Sites Bing Liu, Yiming Ma, Philip S. Yu Héctor A. Villa Martínez.

Information Retrieval

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.

 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.

Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.

Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.

«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,

Computing & Information Sciences Kansas State University Monday, 04 Dec 2006CIS 560: Database System Concepts Lecture 41 of 42 Monday, 04 December 2006.

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:

1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.

April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining

Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.

Chapter 6: Information Retrieval and Web Search

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

Presenter: Shanshan Lu 03/04/2010

Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.

How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”

Vector Space Models.

Data Structures Using C++ 2E

INTEGRATING BROWSING AND SEARCHING WebGlimpse and ScentTrails -Rajesh Golla.

Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.

The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.

© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Database System Concepts, 5th Ed. ©Sang Ho Lee Chapter 19: Information Retrieval.

Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,

Search Engine Optimization

Automated Information Retrieval

Queensland University of Technology

Clustering of Web pages

SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.

Information Retrieval

IST 516 Fall 2011 Dongwon Lee, Ph.D.

Information Retrieval

Data Mining Chapter 6 Search Engines

Introduction to Information Retrieval

Chapter 5: Information Retrieval and Web Search

Chapter 31: Information Retrieval

Information Retrieval and Web Design

Information Retrieval and Web Design

Chapter 19: Information Retrieval

VECTOR SPACE MODEL Its Applications and implementations

Presentation transcript:

25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites” by Bing Liu, Yiming Ma, and Philip S. Yu Presented by Zheyuan Yu

25/03/2003CSCI 6405 Zheyuan Yu2 What is ‘Unexpected Information’ ? Relevant but unknown Contradicts user’s existing beliefs or expectations E.g. A company wants to know what it does not know about competitors

25/03/2003CSCI 6405 Zheyuan Yu3 Existing Extraction Methods Manual Browsing Search Engine – user-specified keywords Web query – languages (SQL) search through info. resources (XML) User preference approach – info. given according to set preference categories

25/03/2003CSCI 6405 Zheyuan Yu4 Problems with Existing Methods Only information expected by or already known to user is returned User cannot search for something he doesn’t know he is looking for Manual examination takes too long

25/03/2003CSCI 6405 Zheyuan Yu5 Proposed approach Aim: Finding interesting/unexpected information To find what is unexpected, we need to know what the user has known? It becomes a problem of comparing user’s website with competitor’s website to find similar and different information.

25/03/2003CSCI 6405 Zheyuan Yu6 How to represent the page’s information Documents and Queries are represented as vectors. Position 1 corresponds to term 1, position 2 to term 2, position t to term t

25/03/2003CSCI 6405 Zheyuan Yu7 Weight tf x idf measure: term frequency (tf) inverse document frequency (idf)

25/03/2003CSCI 6405 Zheyuan Yu8 How to calculate the similarity

25/03/2003CSCI 6405 Zheyuan Yu9 Compare Two Web Sites (1): Similar Pages Goal – find pages in the competitor site that closely match a page in the user site Method – given a u j (user page) in U (user web site), for all c i (competitor page) in C (competitor web site) compute: (u j dot c i ) / (|u j | cross |c i |) Then rank pages in descending order

25/03/2003CSCI 6405 Zheyuan Yu10 Compare Two Web Sites (2): Unexpected Terms Goal – find unexpected terms in a competitor page relative to a user page Method – given a u j in U and a c i in C, find unexp. term k r by computing: unexpT rji = { 1–(tf rj / tf ri ), if (tf rj / tf ri )<= 1 { 0, otherwise Then rank the k terms in descending order

25/03/2003CSCI 6405 Zheyuan Yu11 Compare Two Web Sites (3): Unexpected Pages Goal – find unexpected pages in the competitor site relative to the user site Method – combine all the pages in U to form a single document and all the pages in C to form another single document This is necessary because information on a topic can be contained entirely on one page or spread through many, as web site structures vary

25/03/2003CSCI 6405 Zheyuan Yu12 Compare Two Web Sites (4) Unexpected Concepts Goal – find unexpected concepts in a competitor page relative to a user page More meaningful than keywords Less information for user to look at Method – first use association rule algorithm (Apriori used – next slide) to discover concepts Each page is mined separately because concepts tend to be page based Unexpected term comparison is then done with concepts in place of keywords

25/03/2003CSCI 6405 Zheyuan Yu13 Unexpected Concepts – Apriori Algorithm Keywords in each sentence are a transaction The set of all sentences is a dataset Treat concepts as terms, using method 2 to find unexpected concepts Support = count( k 1 U k 2 ) Confidence = count( k 1 U k 2 ) / count (k 1 ) Candidates pruned based on sup. & con.

25/03/2003CSCI 6405 Zheyuan Yu14 Compare Tow Web Sites (5): Outgoing Links Goal – Find all outgoing links in C that are not in U Method – Links are simply collected by the crawler when it initially explores the U and C sites

25/03/2003CSCI 6405 Zheyuan Yu15 System Screenshot

25/03/2003CSCI 6405 Zheyuan Yu16 Summary of Use User selects a topic of interest, identifies a page of his own that deals with the topic. User then can find pages in a competitor’s site that deal with the same topic, giving the user an idea of the quantity and location of these pages (method 1) User can scan these pages for unexpected information (method 2, method3)

25/03/2003CSCI 6405 Zheyuan Yu17 Summary of Use (cont’d) User can then manually browse similar pages with interesting unexpected information User can find unexpected pages based on concepts (method 4) User can examine unexpected outgoing links for more information or to add the links to his own pages (method 5) Experiments include comparison for travel company, private education institution and diving company. Many piece of unexpected information discovered.

25/03/2003CSCI 6405 Zheyuan Yu18 Time Complexity: Linear in the number of pages. ‘Web Crawling’, one-time, is O(N) where n is number/size of pages ‘Extraction and Mining’, one-time, is O(K 2 N), where K is number of keywords ‘Corresponding Page’ is O(T C N C +N u N C ), where N C is number/size of pages in C, T C is maximum amount of terms in any page in C and N u is size of the page in U (weighting time + similarity computation) ‘Unexpected Terms’ is O(T c ), where T c is the amount of terms in the page in C

25/03/2003CSCI 6405 Zheyuan Yu19 Time Complexity (cont’d): Linear in the number of pages. ‘Unexpected Pages’ is O(T U N U +T C N C ), where T U is maximum terms in a U page and N U is number/size of pages in U, and T C & N C have similar meanings for C (time for merging - unexpP i is T C N C ) ‘Unexpected Concepts’ is O(Co c ), where Co c is the amount of concepts in the page in C ‘Unexpected Links’ is O(L c ) where L c is the amount of links in C Assuming size (or # of keywords) on an average page is constant, then all comparison algorithms are basically linear in the number of pages involved

25/03/2003CSCI 6405 Zheyuan Yu20 Efficiency Experiments run on a PII 350 PC w/ 64MB RAM All computations can be done efficiently Unexpected pages can be found for a 50 page competitor site in about a second ProcessSimilar Page Unexp. Terms Unexp. Pages Assoc. Mining Avg. Time (ms)

25/03/2003CSCI 6405 Zheyuan Yu21 Future Application Research tool to find related topics Shopping comparison between 2 sites

25/03/2003CSCI 6405 Zheyuan Yu22 Summary Unexpected information is interesting Proposed a number of methods Techniques proposed are practical and efficient

25/03/2003CSCI 6405 Zheyuan Yu23 References Liu, Bing,Yiming Ma, Philip S. Yu. Discovering Unexpected Information from Your Competitor’s Web Sites. Proceedings of The Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001), August 26-29, 2001, San Francisco, USA.

25/03/2003CSCI 6405 Zheyuan Yu24 Thank you! Any questions?