Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University
Steve Cassidy Computing at MacquarieNo 2 The First Web Page
Steve Cassidy Computing at MacquarieNo 3 What is the Web? Documents, text, images, sound A web of hyperlinks –Link one (text) document to others Easy to join –Any Internet user can be a publisher Anarchic –No-one is in charge Very big
Steve Cassidy Computing at MacquarieNo 4 The Problem Much of the information available is text-based Text is difficult to process by computers The popular use of computers and the Internet has increased the availability of text-based information Information Overload
Steve Cassidy Computing at MacquarieNo 5 The Solution? Only one of the top four commercial search engines finds itself The best navigation should make it easy to find almost anything on the web (once all the data is entered) The Web 1997
Steve Cassidy Computing at MacquarieNo 6 How do they work? Two major steps –Build an inverted index –Match query terms in the index Problems –The web is very big –Finding relevant documents –Avoiding false hits
Steve Cassidy Computing at MacquarieNo 7 Inverted Index document D1 D2 D3 D1 D1 D3 D1 D2 computer software information language computer software information language computer library retrieval computer information retrieval filtering D1 D2 D3 document
Steve Cassidy Computing at MacquarieNo 8 Building the Index List of web addresses Download web page Parse Web page Index New links Web page text
Steve Cassidy Computing at MacquarieNo 9 Building the Index List of web addresses Download web page Parse Web page Index New links Web page text <a name="works"> How Google Works If you aren't interested in learning how Google creates the index and the database of documents that it accesses when processing a query, skip this description. I adapted the following overview from Chris Sherman and Gary Price's wonderful description of How Search Engines Work in Chapter 2 of The Invisible Web (CyberAge Books, 2001). Google consists of three distinct parts, each of which is run on a distributed network of thousands of low-cost computers and can therefore carry out fast parallel processing. Parallel processing is a method of computation in which many calculations can be performed simultaneiously, significantly speeding up data processing.
Steve Cassidy Computing at MacquarieNo 10 Using the Index D1 D2 D3 D1D1 D3 computer software information document D1 D2 language Query: computer software information D1 D2 D3 D1 D3 D1
Steve Cassidy Computing at MacquarieNo 11 Server Farm Over 10,000 computers Each with a copy of the index
Steve Cassidy Computing at MacquarieNo 12 Relevance Finding pages with search terms is easy Which ones are the best? Google: –Text in titles, headings is important –Text earlier in the page is important –Text of links to this page is important –Important pages link to other important pages
Steve Cassidy Computing at MacquarieNo 13 Making the Most of Search Engines Use words likely to appear in the pages you want Use more query terms to narrow your result Be brief Don’t worry about spelling Use “words in quotes” to search for phrases
Steve Cassidy Computing at MacquarieNo 14 Other Search Engines –Offers ‘refine your search’ –Subject specific popularity –Natural language questions search.yahoo.com
Steve Cassidy Computing at MacquarieNo 15 The Future Information Extraction –Find all the details of this conference for my diary Question Answering –When did Armstrong land on the moon? The Semantic Web –Exchanging machine readable data
Steve Cassidy Computing at MacquarieNo 16 Language Technology SLP148 Language, Logic and Computation COMP248 Language Technology COMP249 Web Technology COMP348 Document Processing and the Semantic Web COMP349 Spoken Language Dialogue Systems
Steve Cassidy Computing at MacquarieNo 17 Questions?