CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

CP3024 Lecture 12 Search Engines

What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

What is a Search Engine?  A page on the web connected to a backend program  Allows a user to enter words which characterise a required page  Returns links to pages which match the query

A Typical Search Engine

Types of Search Engine  Automatic search engine e.g. Altavista, Lycos  Classified Directory e.g. Yahoo!  Meta-Search Engine e.g. Dogpile

Components of a Search Engine  Robot (or Worm or Spider) –collects pages –checks for page changes  Indexer –constructs a sophisticated file structure to enable fast page retrieval  Searcher –satisfies user queries

Query Interface  Usually a boolean interface –(Fred and Jean) or (Bill and Sam)  Normally allows phrase searches –"Fred Smith"  Also proximity searches  Not generally understood by users  May have extra 'friendlier' features ?

Search Results  Presented as links  Supposedly ordered in terms of relevancy to the query  Some Search Engines score results  Normally organised if groups of ten per page

Problems  Links are often out of date  Usually too many links are returned  Returned links are not very relevant  The Engines don't know about enough pages  Different engines return different results  U.S. bias

Improving query results  To look for a particular page use an unusual phrase you know is on that page  Use phrase queries where possible  Check your spelling!  Progressively use more terms  If you don't find what you want, use another Search Engine!

Who operates Search Engines?  People who can get money from venture capitalists!  Many search engines originate from U.S. universities  Often paid for by advertisements  Engines monitor carefully what else interests you (paid by the click)

How do pages get into a Search Engine?  Robot discovery (following links)  Self submission  Payments

Robot Discovery  Robots visit sites while following links  The more links the more visits  Make sure you don't exclude Robots from visiting public pages

Payments  Some search engines only index paying customers  The more you pay the higher you appear on answers to queries

Self submission  Register your page with a search engine  Pay for a company to register you with many search engines  Get registration with many search engines for free!

Getting to the top  Only relevant queries should be ranked highly  Search engines only look at text  Search engine operators try to stop "search engine spamming"  Some queries are pre-answered

Get where you should be!  Put more than graphics on a page  Don't use frames  Use the tag  Make good use of and  Consider using the tag  Get people to link to your page

Summary  Search Engines are vital to the Web user  Search Engines are not perfect by a long way  There are tactics for better searching  Page design can bring more visitors via Search Engines  The more links the better!

WWLib-TNG A Next Generation Search Engine

In the beginning  WWLib-TOS –Manually constructed directory –Classified on Dewey Decimal –Simple data structure –Proof of concept

The New Architecture

The Classifier

Motive - Why Generate Metadata Automatically?  Meta tags are not compulsory  Old pages are less likely to have meta tags  Available data can be unreliable  The Web of Trust requires comprehensive resource description  An essential prerequisite for widespread deployment of RDF applications

Method - How can Metadata be Generated Automatically?  Using an automatic classifier  The classifier classifies Web Pages according to Dewey Decimal Classification  Other useful metadata can be extracted during the process of automatic classification

Automatic Classification  Intended to combine the intuitive accuracy of manually maintained classified directories with the speed and comprehensive coverage of automated search engines  DDC has been adopted because of its universal coverage, multilingual scope and hierarchical nature

Automatic Classifier - How does it work? Firstly, the page is retrieved from a URL or local file and parsed to produce a document object

Automatic Classifier - How does it work? The document object is then compared with DDC objects representing the top ten DDC classes

Automatic Classifier - How does it work?  Each time a word in the document matches a word in the DDC object, the two associated weights are added to a total score  A measure of similarity is then calculated using a similarity coefficient

Automatic Classifier - How does it work?  If there is a significant measure of similarity the document will be compared with any subclasses of that DDC class  If there are no subclasses (i.e. the DDC class is a leaf node) the document is assigned the classmark  If the result is not significant, the comparison process will proceed no further through that particular branch of the DDC object hierarchy

Metadata elements  The automatic classification process can be used to extract other useful metadata elements other than the classification classmarks:  Keywords  Classmarks  Word count  Title  URL  Abstract  A unique accession number and associated dates can be obtained and supplied by the system

Metadata elements - Wolverhampton Core Wolverhampton CoreDublin Core 1Unique Accession numberIdentifier 2Title 3URLIdentifier 4AbstractDescription 5KeywordsSubject 6ClassmarksSubject 7Word count 8Classification date 9Last modified dateDate

RDF Data Model

RDF Schema  There is a significant overlap with the Dublin Core element set  Requirement for implementation clarity  Those that have Dublin Core equivalents are declared as sub-properties  Maintain interoperability with Dublin Core applications

RDF Schema Keyword Classmark

Classifier Evaluation  Automatic metadata generation will become important for the widespread deployment of RDF based applications  Documents created before the invention of RDF generating authoring tools also need to be described  RDF utilised in this manner may encourage interoperability between search engines  More info: http://www.scit.wlv.ac.uk/~ex1253/

Current Status of WWLib-TNG  New results interface proposed –R-wheel (CirSA)  Builder and searcher constructed, now being tested  Classifier constructed  Test Dispatcher/Analyser/Archiver in place

CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Similar presentations

Presentation on theme: "CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Similar presentations

Presentation on theme: "CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!"— Presentation transcript:

Similar presentations

About project

Feedback