Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007. Slides for Chapter 1:

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Chapter 5: Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
IR Models: Overview, Boolean, and Vector
Information Retrieval in Practice
Search Engines and Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information Retrieval in Practice
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
1 Chapter 19: Information Retrieval. ©Silberschatz, Korth and Sudarshan19.2Database System Concepts - 5 th Edition, Sep 2, 2005 Chapter 19: Information.
Chapter 19: Information Retrieval
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information Retrieval
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Search Engines and Information Retrieval Chapter 1.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Computing & Information Sciences Kansas State University Monday, 04 Dec 2006CIS 560: Database System Concepts Lecture 41 of 42 Monday, 04 December 2006.
1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Data e Web Mining. - S. Orlando 1 Information Retrieval and Web Search Salvatore Orlando Bing Liu. “Web Data Mining: Exploring Hyperlinks, Contents”, and.
General Architecture of Retrieval Systems 1Adrienn Skrop.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Database System Concepts, 5th Ed. ©Sang Ho Lee Chapter 19: Information Retrieval.
Data mining in web applications
Information Retrieval in Practice
Information Retrieval in Practice
Search Engine Architecture
Information Retrieval
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Information Retrieval
Text Categorization Assigning documents to a fixed set of categories
Data Mining Chapter 6 Search Engines
Chapter 5: Information Retrieval and Web Search
Boolean and Vector Space Retrieval Models
Chapter 31: Information Retrieval
Information Retrieval and Web Design
Chapter 19: Information Retrieval
Presentation transcript:

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 1 Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search The Web Challenges Crawling the Web Indexing and Keyword Search Evaluating Search Quality Similarity Search

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 2 The Web Challenges Tim Berners-Lee, Information Management: A Proposal, CERN, March 1989.

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 3 The Web Challenges 18 years later … The recent Web is huge and grows incredibly fast. About ten years after the Tim Berners-Lee proposal the Web was estimated to 150 million nodes (pages) and 1.7 billion edges (links). Now it includes more than 4 billion pages, with about a million added every day. Restricted formal semantics - nodes are just web pages and links are of a single type (e.g. “refer to”). The meaning of the nodes and links is not a part of the web system, rather it is left to the web page developers to describe in the page content what their web documents mean and what kind of relations they have with the documented they link to. As there is no central authority or editors relevance, popularity or authority of web pages are hard to evaluate. Links are also very diverse and many have nothing to do with content or authority (e.g. navigation links).

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 4 The Web Challenges How to turn the web data into web knowledge Use the existing Web –Web Search Engines –Topic Directories Change the Web –Semantic Web

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 5 Crawling The Web To make Web search efficient search engines collect web documents and index them by the words (terms) they contain. For the purposes of indexing web pages are first collected and stored in a local repository Web crawlers (also called spiders or robots) are programs that systematically and exhaustively browse the Web and store all visited pages Crawlers follow the hyperlinks in the Web documents implementing graph search algorithms like depth-first and breadth-first

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 6 Crawling The Web Depth-first Web crawling limited to depth 3

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 7 Crawling The Web Breadth-first Web crawling limited to depth 3

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 8 Crawling The Web Issues in Web Crawling: Network latency (multithreading) Address resolution (DNS caching) Extracting URLs (use canonical form) Managing a huge web page repository Updating indices Responding to constantly changing Web Interaction of Web page developers Advanced crawling by guided (informed) search (using web page ranks)

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 9 Indexing and Keyword Search We need efficient content-based access to Web documents Document representation: –Term-document matrix (inverted index) Relevance ranking: –Vector space model

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 10 Indexing and Keyword Search Creating term-document matrix (inverted index) Documents are tokenized ( punctuation marks are removed and the character strings without spaces are considered as tokens) All characters are converted to upper or to lower case. Words are reduced to their canonical form (stemming) Stopwords (a, an, the, on, in, at, etc.) are removed. The remaining words, now called terms are used as features (attributes) in the term-document matrix

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 11 CCSU Departments example Document statistics Document IDDocument namewordsterms d1d1 Anthropology11486 d2d2 Art d3d3 Biology12391 d4d4 Chemistry8758 d5d5 Communication12488 d6d6 Computer Science10177 d7d7 Criminal Justice8560 d8d8 Economics10776 d9d9 English11680 d 10 Geography9568 d 11 History10878 d 12 Mathematics8966 d 13 Modern Languages11075 d 14 Music13791 d 15 Philosophy8554 d 16 Physics d 17 Political Science12086 d 18 Psychology9660 d 19 Sociology9966 d 20 Theatre11680 Total number of words/terms Number of different words/terms744671

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 12 CCSU Departments example Boolean (Binary) Term Document Matrix DIDlablaboratoryprogrammingcomputerprogram d1d d2d d3d d4d d5d d6d d7d d8d d9d d d d d d d d d d d d

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 13 CCSU Departments example Term document matrix with positions DIDlablaboratoryprogrammingcomputerprogram d1d1 0000[71] d2d2 0000[7] d3d3 0[65,69]0[68]0 d4d4 000[26][30,43] d5d d6d6 00[40,42][1,3,7,13,26,34][11,18,61] d7d7 0000[9,42] d8d8 0000[57] d9d d d d [17]0 d d 14 [42]00[41][71] d [37,38] d [81] d [68] d d [51] d

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 14 Vector Space Model Boolean representation documents d 1, d 2, …, d n terms t 1, t 2, …, t m term t i occurs n ij times in document d j. Boolean representation: For example, if the terms are: lab, laboratory, programming, computer and program. Then the Computer Science document is represented by the Boolean vector

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 15 Term Frequency (TF) representation Document vector with components Using the sum of term counts: Using the maximum of term counts: Cornell SMART system:

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 16 Inverted Document Frequency (IDF) D ocument collection:, documen ts that contain term : Simple fraction: or Using a log function:

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 17 TFIDF representation For example, the computer science TF vector scaled with the IDF of the terms results in lablaboratoryProgrammingcomputerprogram

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 18 Relevance Ranking Represent the query as a vector q = {computer, program} Apply IDF to its components Use Euclidean norm of the vector difference or Cosine similarity (equivalent to dot product for normalized vectors) lablaboratoryProgrammingcomputerprogram

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 19 Relevance Ranking DocTFIDF Coordinates (normalized) (rank) d1d d2d d3d d4d (1)0.298 (1) d5d d6d (2)0.603 (2) d7d d8d d9d d d d d d (3)1.043 (3) d d d d d d Cosine similarities and distances to (normalized)

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 20 Relevance Feedback The user provides feed back: Relevant documents Irrelevant documents The original query vector is updated (Rocchio’s method) Pseudo-relevance feedback Top 10 documents returned by the original query belong to D + The rest of documents belong to D -

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 21 Advanced text search Using ”OR” or “NOT” boolean operators Phrase Search –Statistical methods to extract phrases from text –Indexing phrases Part-of-speech tagging Approximate string matching (using n-grams) –Example: match “program” and “prorgam” {pr, ro, og, gr, ra, am} ∩ {pr, ro, or, rg, ga, am} = {pr, ro, am}

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 22 Using the HTML structure in keyword search Titles and metatags –Use them as tags in indexing –Modify ranking depending on the context where the term occurs Headings and font modifiers (prone to spam) Anchor text –Plays an important role in web page indexing and search –Allows to increase search indices with pages that have never been crawled –Allows to index non-textual content (such as images and programs

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 23 Evaluating search quality Assume that there is a set of queries Q and a set of documents D, and for each query submitted to the system we have: –The response set of documents (retrieved documents) –The set of relevant documents selected manually from the whole collection of documents, i.e.

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 24 Precision-recall framework (set-valued) Determine the relationship between the set of relevant documents ( ) and the set of retrieved documents ( ) Ideally Generally A very general query leads to recall = 1, but low precision A very restrictive query leads to precision =1, but low recall A good balance is needed to maximize both precision and recall

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1: Information Retrieval an Web Search 25 Precision-recall framework (using ranks) With thousands of documents finding is practically impossible. So, let’s consider a list of ranked documents (highest rank first) For each compute its relevance as Define precision at rank k as Define recall at rank k as