Natural Language Processing WEB SEARCH ENGINES August, 2002.

Slides:



Advertisements
Similar presentations
Searching The Internet Practical Strategies. URLs Look at the URL to determine what type of organization produced the site..com is a commercial site..edu.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
1 ETT 429 Spring 2007 Microsoft Publisher II. 2 World Wide Web Terminology Internet Web pages Browsers Search Engines.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Search engines. The number of Internet hosts exceeded in in in in in
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Searching the World Wide Web From Greenlaw/Hepp, In-line/On-line: Fundamentals of the Internet and the World Wide Web 1 Introduction Directories, Search.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Unit 3 Web Search Engines. Can You Find the Answers? n Connect to Google Google n Search for items on Iran Records ________ n Combine Iran with nuclear.
Overview of Search Engines
Internet Research Search Engines & Subject Directories.
What are search engines? Tools used for locating web pages Automated software programs known as spiders or bots to survey the Web and build their databases.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
Internet Research, Second Edition- Illustrated 1 Internet Research: Unit A Searching the Internet Effectively.
Net Search Engines The Which, Why and How Tim Landeck Handouts/PowerPoint available at:
Lesson 12 — The Internet and Research
Chapter 5 Searching for Truth: Locating Information on the WWW.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Promotion & Cataloguing AGCJ 407 Web Authoring in Agricultural Communications.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
CSCI-235 Micro-Computer in Science Internet Search.
Searching Information. General Steps Identifying Key Words, Synonyms, and Key Phrases Constructing an effective search statement Advance search/boolean.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 Search Engines Emphasis on Google.com. 2 Discovery  Discovery is done by browsing & searching data on the Web.  There are 2 main types of search facilities.
Search Engines AGCM 4143 Electronic Communications in Agriculture.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Searching The Internet Open Text Searching vs. Subject Tree Search Open Text Search Search Engine scans the Web looking for a word or group of words.
Search Tools and Search Engines Searching for Information and common found internet file types.
Search Engines By: Faruq Hasan.
Unit 1—Computer Basics Lesson 3 The Internet and Research.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Online Database vs. Web Search Engines 571-Information Access and Retrieval.
CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!
The anatomy of a Large-Scale Hypertextual Web Search Engine.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
SEARCH ENGINES The World Wide Web contains a wealth of information, so much so that without search facilities it could be impossible to find what you were.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Learning how to search on the web “If all you ever do is all you’ve ever done, then all you’ll ever get is all you’ve ever got.” (author unknown)
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Education 499-R01 Search Basics.
Search Engines and Search techniques
Types of Search Questions
Chapter Five Web Search Engines
Search Engines & Subject Directories
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Information Retrieval
Data Mining Chapter 6 Search Engines
Searching for Truth: Locating Information on the WWW
Search Engines & Subject Directories
Search Engines & Subject Directories
Presentation transcript:

Natural Language Processing WEB SEARCH ENGINES August, 2002

The World Wide Web The World Wide Web is estimated to contain more than seven billion pages of publicly-accessible information. The Web continues to grow at an exponential rate: tripling in size over the past two years, according to one estimate. All this data is uncatalogued and unclassified.

Is this a Library? Definitely not! Many a times no titles, no author names, no publication dates.... No specific way of arranging the text: no classification or cataloguing. New data appearing every day and some old data disappearing.

Variability in WWW pages Documents on the web have extreme variation internal to the documents, and also in the external meta information that might be available. documents differ internally in their language, vocabulary, type or format etc. Meta data includes: reputation of the source, update frequency, quality, popularity or usage, and citations.

How do we search the WWW? Subject Directories: Allow the user to browse through lists of WWW sites that are hierarchically organized indexes of subject categories. Search Engines: Allow the user to enter keywords that are run against a database. General directories, subject specific directories, general search engines, multi threaded search engines, subject specific search engines, all exist.

What a Search Engine does not do A search engine does not search the whole WWW. As of Jan, 2004, Google reports its size as 3,307,998,701 pages. This is not the complete WWW. A search engine is not searching the Internet "live," as it exists at this very moment. Database updated every few hours, days or even months.

How it works The three parts of a search engine: A mechanism that identifies web pages to be included in the database. A mechanism that indexes the sites. A searching mechanism with an interface, which scans, for keywords within the index. At run time: Users search the index through queries. Documents in which the search terms occur are presented as "hits." The documents are listed according to some relevance criteria.

Indexing A search engine uses its index to retrieve web documents in which your search terms occur. Hit List: The index lists the term and where it occurs (the URL or address of the web page, position in the page, font, capitalization etc.) much like a book index. Every single word is included in the index!

Hand Indexing and Automatic Indexing Human maintained Indices: Yahoo! Cover popular topics effectively but are subjective, expensive to build and maintain, slow to improve, and cannot cover all esoteric topics. Automatic generation of Indices: Google Low quality matches. Can be mislead.

A 'bot Also called: An intelligent agent, spider, crawler, robot, or worm. An automated device (software) which may be programmed to search for terms (data "strings") matching certain criteria. A 'bot identifies and notes the url's of web pages to be included in the database Another 'bot then works on the interiors of the web documents, recording occurrences of words and their position within the text. This information is used to create a huge index.

Querying The query terms are treated as keywords to be found in the documents. In the second generation web search engines natural language queries are understood and then acted upon.

Relevance (Results Ranking) Relevance calculated based on how many times the search terms were found in the site. Noting where the term occurs within the text and assigning this position a "weight" or level of importance. Search terms occurring in the title, summary, in key positions within a paragraph or appearing several times within a paragraph usually carry more "weight." For multiple terms higher weights given when terms appear closer together.

Relevance (Cont.) Incorporating the popularity element. Looking at how many links a web document has from other websites, and also the quality of the referring websites. Ranking according to sites other searchers have chosen from their results to similar queries.

Query Evaluation Parse the query. Scan through the documents to find those matching the queries. Rank the documents that matched the queries and present the top K.

Examples of search engines AltaVista ( Excite ( FAST ( Google ( HotBot ( Northern Light (

Evaluating a Search Engine Quality of results Coverage Scalability Efficiency in Storage and Retrieval Query handling speed Interface quality and ease Good quality results. Efficient Crawling, Indexing and Searching.

Techniques in Searching Extend traditional IR techniques. For example, the standard vector space model tries to return the document that most closely approximates the query, given that both query and document are vectors defined by their word occurrence. On the web, this strategy often returns very short documents that are the query plus a few words.

Google As of Jan, 2004, Google reports its size as 3,307,998,701 pages. Automatic indexing of pages. Implemented in C/C++ In addition to keyword locations on a page it makes use of the link structure of the Web To improve search results. To calculate a quality ranking for each web page. Data structures designed to avoid disk seeks whenever possible.

PageRank – the random surfer model PR(A) = (1-d) + d(PR(t1)/C(t1) PR(tn)/C(tn)) T1,t2…tn are pages linking to page A. C(ti) is the number outgoing links of ti d is a parameter, usually around Iterate (50 times) Rescale (logarithmically?)

Yahoo! Human maintained indexes Covers popular topics Subjective Expensive to build and maintain Slow to improve Cannot cover all esoteric topics In Yahoo, you are searching only the title and the short descriptive blurb about the site; by contrast, search engines usually give you access to the full text of the document.

Alta Vista Offers spell check. Recognizes capitalization and proper nouns. Offers search in numerous languages. Ranks according to how many of the search terms a page contains, where in the document, and how close to one another the search terms are.

Ask Jeeves Ask Jeeves is a natural language search engine which attempts to resolve user questions into appropriate answers. Does semantic and syntactic processing of the query to understand the question. Learns from previous interactions with other users to get to popular resources. Guides (interacts with) the user into asking "useful" questions. Retrieves the sites with the best answers. It is a multi-threaded search engine.