1 Discovering Unexpected Information from Your Competitor’s Web Sites Bing Liu, Yiming Ma, Philip S. Yu Héctor A. Villa Martínez.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Search in Source Code Based on Identifying Popular Fragments Eduard Kuric and Mária Bieliková Faculty of Informatics and Information.
Chapter 5: Introduction to Information Retrieval
Optimizing search engines using clickthrough data
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
Information Retrieval in Practice
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Internet Research Search Engines & Subject Directories.
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
Web Searching. Web Search Engine A web search engine is designed to search for information on the World Wide Web and FTP servers The search results are.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
LIS618 lecture 1 Thomas Krichel economic rational for traditional model In olden days the cost of telecommunication was high. database use.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
The Development of the Ceramics and Glass website Mia Ridge Museum Systems Team Museum of London.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Keyword Query Routing.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Facilitating Document Annotation using Content and Querying Value.
Chapter 23: Probabilistic Language Models April 13, 2004.
Vector Space Models.
Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.
1 Information Retrieval LECTURE 1 : Introduction.
Augmenting (personal) IR Readings Review Evaluation Papers returned & discussed Papers and Projects checkin time.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
Facilitating Document Annotation Using Content and Querying Value.
Multimedia Web site development Plan your site Steps for creating web pages.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Information Retrieval in Practice
Information Organization: Overview
Text Based Information Retrieval
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Search Engines & Subject Directories
Information Retrieval
Data Mining Chapter 6 Search Engines
Search Engines & Subject Directories
Chapter 5: Information Retrieval and Web Search
Search Engines & Subject Directories
Information Organization: Overview
Restructuring Sparse High Dimensional Data for Effective Retrieval
Presentation transcript:

1 Discovering Unexpected Information from Your Competitor’s Web Sites Bing Liu, Yiming Ma, Philip S. Yu Héctor A. Villa Martínez

2 Objective of this article The authors presents a system to help find unexpected information in a web site.

3 Searching information in the web Many methods Keyword based (e.g. Goggle, Yahoo). Wrapper based (e.g. extract prices). Web query languages (e.g. extend SQL). User preference based (specify categories).

4 Searching information in the web Main drawbacks: Hard to find unexpected information. Only finds anticipated information.

5 What is unexpected anyways? A piece of information is unexpected if: it is relevant but unknown, or it contradicts existing beliefs or expectations relevant  interesting (subjective)

6 Summary of the approach U: user web site E: knowledge about the competitor C: competitor web site Compare C vs. U and E to find unexpected information in C.

7 How to compare two web pages Use the vector space representation: Define a set of p keywords (index terms) K = {k 1, k 2, …, k p ). Represent a document D using a vector D = {w 1, w 2, …, w p } where w i is the weight of the keyword i w i > 0 if keyword i appears in D = 0 otherwise

8 Vector space representation Example: K = {night, day, empire, barbarians, people, house} D = [“Because night is here but the barbarians have not come. And some people arrived from the borders, and said that there are no longer any barbarians. And now what shall become of us without any barbarians? Those people were some kind of solution.”] D = {1, 0, 0, 3, 2, 0} or normalized to: D = {1/6, 0, 0, 3/6, 2/6, 0}

9 Comparing two web pages Given two web pages in vector space representation, D = {d 1, d 2, …, d n }, and Q = {q 1, q 2, …, q n } the cosine gives a measure of similarity: sim (D, Q) = (D ● Q) / (|D| * |Q|)

10 Comparing two web pages Example: P = {0.3, 0.0, 0.0, 0.7} Q = {0.5, 0.0, 0.1, 0.4} R = {0.0, 0.5, 0.5, 0.0} Sim (P, P) = (P ● P) / (|P| * |P|) = 1.0 Sim (P, Q) = (P ● Q) / (|P| * |Q|) = 0.87 Sim (P, R) = (P ● R) / (|P| * |R|) = 0

11 Methods to find unexpected information in a site Let U = (u 1, …, u m ) the user web site, and C = (c 1, …, c n ) the competitor web site: 1. Find the corresponding C page(s) of a U page. 2. Find unexpected terms in a C page. 3. Find unexpected pages in C. 4. Find unexpected concepts in a C page. 5. Find unexpected outgoing links.

12 1. Find the corresponding C page(s) of a U page Given a page u i in U Compare u i with each page in C. Order the results in descending order.

13 1. Find the corresponding C page(s) of a U page Example: Select u 1 Find sim(u 1, c 1 ), sim(u 1, c 2 ), …, sim(u 1, c n ) Order the results in decreasing order: say c 4, c 2, c 8, … etc. Complexity: O(G|C| + |u i ||C|) where G = max number of terms in c j

14 2. Find unexpected terms in a C page Given u j and c i measure the unexpectedness of each term t r  1 – (f rj / f ri ) if (f rj / f ri ) ≤ 1 unexpT rij =   0 otherwise

15 2. Find unexpected terms in a C page Example: keywords = {data, predict, classify, state} u j = {0.4, 0.5, 0.0, 0.1} c i = {0.3, 0.3, 0.2, 0.2} unexpT = {0, 0, 1, 0.5} Complexity: O(Z) where Z = number of terms in c j

16 3. Find unexpected pages in C 1. Combine all pages of U in a single page D u. 2. Combine all pages of C in a single page D c. 3. Compute the unexpectedness of each term k t in D c with respect to D u. (Task 2) 4. The unexpectedness of a page C i is the sum of the unexpectedness of its terms 5. unexpP i = (ΣunexpT rcu ) / m

17 3. Find unexpected pages in C Complexity O(M u |U| + M c |C|) where M u is the maximal number of terms in a U page M c is the maximal number of terms in a C page

18 4. Find unexpected concepts in a C page A concept is a set of keywords that occur together and express the same idea. Example: “information extraction”, “extraction of information”, and “information is extracted” express the same idea “information extraction”

19 4. Find unexpected concepts in a C page Algorithm Divide the page in sentences. Use the Apriori algorithm (Agrawal & Srikant) to find association rules of the form X  Y with confidence c, where X and Y  K, the set of keywords and c is user defined. These association rules are the concepts present in the page.

20 4. Find unexpected concepts in a C page 3. Treating each concept as a term, proceed as Task 2, finding unexpected terms in C.

21 5. Find unexpected outgoing links Let L u be the set of outgoing links from U Let L c be the set of outgoing links from C unexpL = L c – L u

22 Incorporating user knowledge Let E be the user knowledge about his competitor. E is specified as: Keyword terms Concepts Links

23 Incorporating user knowledge The elements in E are incorporated in task 2 thru 5 to find unexpected terms, pages, concepts, and links. Elements in E are ranked low in unexpectedness.

24 System architecture C++/Win32 A spider or crawler. Collects information. Keyword extractor & concepts finder. Comparison component. Do tasks 1-5. User interface.

25 A running example The authors compare its own site with SGI’s MineSet data mining site, and not extra knowledge:

26 Results Found documentation pages in SGI site. Now the authors are planning to add their own. Found previously unknown pages describing MineSet technology. Found some previously unknown MineSet features. Found many interesting terms, concepts, and links.

27 Evaluation The system was further tested with three different organizations: Travel company Private school Diving company

28 Evaluation The users reported the system helped them in: Focus in interesting pages, terms, and concepts. Make a more complete analysis of the competitor’s site. Not missing important information. Find unexpected things.

29 Efficiency If number of keywords is constant, the algorithms are linear in the number of pages. Tested on a Pentium II 350 PC with 64MB of RAM #pag sim unexpTrij unexpPj Assoc. mining [1] (143) [2] (21) [3] (66) [4] (127) [5] (46) [1] [4] [2] [5] [3]

30 Future work Use of metadata. Study how links can be used to infer more unexpected information. Monitor the site, reporting any unexpected change.

31 Intrinsic limitations Text oriented. Do not work with images. Can have problems with tables. Do not work with dynamic web sites.

32