Presentation is loading. Please wait.

Presentation is loading. Please wait.

CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Similar presentations


Presentation on theme: "CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!"— Presentation transcript:

1 CP3024 Lecture 12 Search Engines

2 What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

3 What is a Search Engine?  A page on the web connected to a backend program  Allows a user to enter words which characterise a required page  Returns links to pages which match the query

4 A Typical Search Engine

5 Types of Search Engine  Automatic search engine e.g. Altavista, Lycos  Classified Directory e.g. Yahoo!  Meta-Search Engine e.g. Dogpile

6 Components of a Search Engine  Robot (or Worm or Spider) –collects pages –checks for page changes  Indexer –constructs a sophisticated file structure to enable fast page retrieval  Searcher –satisfies user queries

7 Query Interface  Usually a boolean interface –(Fred and Jean) or (Bill and Sam)  Normally allows phrase searches –"Fred Smith"  Also proximity searches  Not generally understood by users  May have extra 'friendlier' features ?

8 Search Results  Presented as links  Supposedly ordered in terms of relevancy to the query  Some Search Engines score results  Normally organised if groups of ten per page

9 Problems  Links are often out of date  Usually too many links are returned  Returned links are not very relevant  The Engines don't know about enough pages  Different engines return different results  U.S. bias

10 Improving query results  To look for a particular page use an unusual phrase you know is on that page  Use phrase queries where possible  Check your spelling!  Progressively use more terms  If you don't find what you want, use another Search Engine!

11 Who operates Search Engines?  People who can get money from venture capitalists!  Many search engines originate from U.S. universities  Often paid for by advertisements  Engines monitor carefully what else interests you (paid by the click)

12 How do pages get into a Search Engine?  Robot discovery (following links)  Self submission  Payments

13 Robot Discovery  Robots visit sites while following links  The more links the more visits  Make sure you don't exclude Robots from visiting public pages

14 Payments  Some search engines only index paying customers  The more you pay the higher you appear on answers to queries

15 Self submission  Register your page with a search engine  Pay for a company to register you with many search engines  Get registration with many search engines for free!

16 Getting to the top  Only relevant queries should be ranked highly  Search engines only look at text  Search engine operators try to stop "search engine spamming"  Some queries are pre-answered

17 Get where you should be!  Put more than graphics on a page  Don't use frames  Use the tag  Make good use of and  Consider using the tag  Get people to link to your page

18 Summary  Search Engines are vital to the Web user  Search Engines are not perfect by a long way  There are tactics for better searching  Page design can bring more visitors via Search Engines  The more links the better!

19 WWLib-TNG A Next Generation Search Engine

20 In the beginning  WWLib-TOS –Manually constructed directory –Classified on Dewey Decimal –Simple data structure –Proof of concept

21 The New Architecture

22 The Classifier

23 Motive - Why Generate Metadata Automatically?  Meta tags are not compulsory  Old pages are less likely to have meta tags  Available data can be unreliable  The Web of Trust requires comprehensive resource description  An essential prerequisite for widespread deployment of RDF applications

24 Method - How can Metadata be Generated Automatically?  Using an automatic classifier  The classifier classifies Web Pages according to Dewey Decimal Classification  Other useful metadata can be extracted during the process of automatic classification

25 Automatic Classification  Intended to combine the intuitive accuracy of manually maintained classified directories with the speed and comprehensive coverage of automated search engines  DDC has been adopted because of its universal coverage, multilingual scope and hierarchical nature

26 Automatic Classifier - How does it work? Firstly, the page is retrieved from a URL or local file and parsed to produce a document object

27 Automatic Classifier - How does it work? The document object is then compared with DDC objects representing the top ten DDC classes

28 Automatic Classifier - How does it work?  Each time a word in the document matches a word in the DDC object, the two associated weights are added to a total score  A measure of similarity is then calculated using a similarity coefficient

29 Automatic Classifier - How does it work?  If there is a significant measure of similarity the document will be compared with any subclasses of that DDC class  If there are no subclasses (i.e. the DDC class is a leaf node) the document is assigned the classmark  If the result is not significant, the comparison process will proceed no further through that particular branch of the DDC object hierarchy

30 Metadata elements  The automatic classification process can be used to extract other useful metadata elements other than the classification classmarks:  Keywords  Classmarks  Word count  Title  URL  Abstract  A unique accession number and associated dates can be obtained and supplied by the system

31 Metadata elements - Wolverhampton Core Wolverhampton CoreDublin Core 1Unique Accession numberIdentifier 2Title 3URLIdentifier 4AbstractDescription 5KeywordsSubject 6ClassmarksSubject 7Word count 8Classification date 9Last modified dateDate

32 RDF Data Model

33 RDF Schema  There is a significant overlap with the Dublin Core element set  Requirement for implementation clarity  Those that have Dublin Core equivalents are declared as sub-properties  Maintain interoperability with Dublin Core applications

34 RDF Schema Keyword Classmark

35 Classifier Evaluation  Automatic metadata generation will become important for the widespread deployment of RDF based applications  Documents created before the invention of RDF generating authoring tools also need to be described  RDF utilised in this manner may encourage interoperability between search engines  More info: http://www.scit.wlv.ac.uk/~ex1253/

36 Current Status of WWLib-TNG  New results interface proposed –R-wheel (CirSA)  Builder and searcher constructed, now being tested  Classifier constructed  Test Dispatcher/Analyser/Archiver in place


Download ppt "CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!"

Similar presentations


Ads by Google