Natural Language Processing WEB SEARCH ENGINES August, 2002
The World Wide Web The World Wide Web is estimated to contain more than seven billion pages of publicly-accessible information. The Web continues to grow at an exponential rate: tripling in size over the past two years, according to one estimate. All this data is uncatalogued and unclassified.
Is this a Library? Definitely not! Many a times no titles, no author names, no publication dates.... No specific way of arranging the text: no classification or cataloguing. New data appearing every day and some old data disappearing.
Variability in WWW pages Documents on the web have extreme variation internal to the documents, and also in the external meta information that might be available. documents differ internally in their language, vocabulary, type or format etc. Meta data includes: reputation of the source, update frequency, quality, popularity or usage, and citations.
How do we search the WWW? Subject Directories: Allow the user to browse through lists of WWW sites that are hierarchically organized indexes of subject categories. Search Engines: Allow the user to enter keywords that are run against a database. General directories, subject specific directories, general search engines, multi threaded search engines, subject specific search engines, all exist.
What a Search Engine does not do A search engine does not search the whole WWW. As of Jan, 2004, Google reports its size as 3,307,998,701 pages. This is not the complete WWW. A search engine is not searching the Internet "live," as it exists at this very moment. Database updated every few hours, days or even months.
How it works The three parts of a search engine: A mechanism that identifies web pages to be included in the database. A mechanism that indexes the sites. A searching mechanism with an interface, which scans, for keywords within the index. At run time: Users search the index through queries. Documents in which the search terms occur are presented as "hits." The documents are listed according to some relevance criteria.
Indexing A search engine uses its index to retrieve web documents in which your search terms occur. Hit List: The index lists the term and where it occurs (the URL or address of the web page, position in the page, font, capitalization etc.) much like a book index. Every single word is included in the index!
Hand Indexing and Automatic Indexing Human maintained Indices: Yahoo! Cover popular topics effectively but are subjective, expensive to build and maintain, slow to improve, and cannot cover all esoteric topics. Automatic generation of Indices: Google Low quality matches. Can be mislead.
A 'bot Also called: An intelligent agent, spider, crawler, robot, or worm. An automated device (software) which may be programmed to search for terms (data "strings") matching certain criteria. A 'bot identifies and notes the url's of web pages to be included in the database Another 'bot then works on the interiors of the web documents, recording occurrences of words and their position within the text. This information is used to create a huge index.
Querying The query terms are treated as keywords to be found in the documents. In the second generation web search engines natural language queries are understood and then acted upon.
Relevance (Results Ranking) Relevance calculated based on how many times the search terms were found in the site. Noting where the term occurs within the text and assigning this position a "weight" or level of importance. Search terms occurring in the title, summary, in key positions within a paragraph or appearing several times within a paragraph usually carry more "weight." For multiple terms higher weights given when terms appear closer together.
Relevance (Cont.) Incorporating the popularity element. Looking at how many links a web document has from other websites, and also the quality of the referring websites. Ranking according to sites other searchers have chosen from their results to similar queries.
Query Evaluation Parse the query. Scan through the documents to find those matching the queries. Rank the documents that matched the queries and present the top K.
Examples of search engines AltaVista ( Excite ( FAST ( Google ( HotBot ( Northern Light (
Evaluating a Search Engine Quality of results Coverage Scalability Efficiency in Storage and Retrieval Query handling speed Interface quality and ease Good quality results. Efficient Crawling, Indexing and Searching.
Techniques in Searching Extend traditional IR techniques. For example, the standard vector space model tries to return the document that most closely approximates the query, given that both query and document are vectors defined by their word occurrence. On the web, this strategy often returns very short documents that are the query plus a few words.
Google As of Jan, 2004, Google reports its size as 3,307,998,701 pages. Automatic indexing of pages. Implemented in C/C++ In addition to keyword locations on a page it makes use of the link structure of the Web To improve search results. To calculate a quality ranking for each web page. Data structures designed to avoid disk seeks whenever possible.
PageRank – the random surfer model PR(A) = (1-d) + d(PR(t1)/C(t1) PR(tn)/C(tn)) T1,t2…tn are pages linking to page A. C(ti) is the number outgoing links of ti d is a parameter, usually around Iterate (50 times) Rescale (logarithmically?)
Yahoo! Human maintained indexes Covers popular topics Subjective Expensive to build and maintain Slow to improve Cannot cover all esoteric topics In Yahoo, you are searching only the title and the short descriptive blurb about the site; by contrast, search engines usually give you access to the full text of the document.
Alta Vista Offers spell check. Recognizes capitalization and proper nouns. Offers search in numerous languages. Ranks according to how many of the search terms a page contains, where in the document, and how close to one another the search terms are.
Ask Jeeves Ask Jeeves is a natural language search engine which attempts to resolve user questions into appropriate answers. Does semantic and syntactic processing of the query to understand the question. Learns from previous interactions with other users to get to popular resources. Guides (interacts with) the user into asking "useful" questions. Retrieves the sites with the best answers. It is a multi-threaded search engine.