Presentation on theme: "Web indexing ICE0534 – Web-based Software Development July 21. 2005 Seonah Lee."— Presentation transcript:
Web indexing ICE0534 – Web-based Software Development July 21. 2005 Seonah Lee
Contents News related to Web Indexing Web Indexing? Web Indexing: Styles Web Indexing: Tools Web Indexing in Search Engine Web Indexing in Google Summary References Question
Google tests tool to aid Web indexing By Dawn Kawamoto, CNET News.com, Monday, June 06 2005 12:00 AM
Web Indexing? Creating indexes for individual web sites Intranets collections of HTML documents collections of web sites. Purpose for helping users find information using a variety of keywords and gathering similar information.
Web Indexing? Indexes systematically arranged items entry points to go directly to desired information within a larger document or set of documents Indexing an analytic process of determining which concepts are worth indexing, what entry labels to use, and how to arrange the entries.
Web Indexing: Styles (1/2) Back-of-the-Book Style Web Indexing Including “A-Z indexes” to websites or an Intranet Some web indexes take the form of a list of hierarchical categories arranged in alphabetical order
Web Indexing: Styles (2/2) Metadata and Web Indexing assigning keywords or phrases to web pages or web sites within a meta-tag field so that the web page or web site can be retrieved with a search engine that is customized to search the keywords field.
Web Indexing: Tools CategoryDescriptionTools Standalone or Dedicated tools They are usually used for back- of-the-book indexes HTML Indexer XRefHT32 Embedded indexing It is the process of creating index entries electronically in a document's files FrameMaker Microsoft Word TaggingIt inserts numbered dummy tags in the files, and then builds the index separately In-house Tools KeywordingIt is used primarily in online help materials RoboHelp HTML Utilities and Add-ons It converts a ASCII index file to HTML documents HTML/Prep text searching tools They are aspects of information retrieval that indexers are very interested in. SWISH
Web Indexing: The Most Famous Tool HTML Indexer, by Brown Inc. http://www.html-indexer.com/index.html http://www.html-indexer.com/index.html
Web Indexing in Search Engine Phases of work of Web SE Document gathering Document indexing Searching in response to a query Visualization of search results Parse Query Gathering Indexing Rank or Match The Web Visualization
Web Indexing in Search Engine Almost every Web Search Engine uses a slightly different technique The parsing discards some html marking Some give different weight to terms in different html field Some do not index the full text of the document, but only part of it Some make full use of “metadata” Very few make use of the information provided by linking: HITS and PageRank (Google)
Web Indexing in Google PageRank Google assigns a number called the PageRank to every web page that it knows about. Assumption: A page is important if other important web pages link to it Main PageGoogle This PageYahoo Each Page = Node Directed Edge = a link from one to the other
Web Indexing in Google PageRank: Example R1 R3 R2 R1 = R3 R2 = R1 / 2 R3 = R1 / 2 + R2 R1: 1.2 R3: 1.2 R2: 0.6 R1 = 2R1 R3 = R1 3 = R1 + R2 + R3 Assumption: an average page has a PageRank of 1
Web Indexing in Google HITS (Hyperlink-Induced Topic Search) Divides pages relating to a topic into two groups Authorities: pages with good content about a topic Hubs: pages that link to many authority pages on a topic (directory) Iteratively calculate hub and authority scores for each page in neighborhood and rank results accordingly Document that many pages point to is a good authority Document that points to many authorities is a good hub, pointing to many good authorities makes for an even better hub
Summary Web Indexing Web Indexing Styles Back-of-the-Book Style Web Indexing Metadata and Web Indexing Web Indexing Techniques in Google HITS PageRank