CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Slides:



Advertisements
Similar presentations
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Advertisements

Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Page 1 June 2, 2015 Optimizing for Search Making it easier for users to find your content.
Information Retrieval in Practice
Search Engines and Information Retrieval
Search Engines and Listings
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
(c) Maria Indrawan Distributed Information Retrieval.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
Introduction Web Development II 5 th February. Introduction to Web Development Search engines Discussion boards, bulletin boards, other online collaboration.
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Overview of Search Engines
Internet Research Search Engines & Subject Directories.
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
Search Engine Optimization By Tom Fallenstein. Introduction Why you want high rankings Why you want high rankings Keywords Keywords Tools to help choose.
Cutting Through the Clutter Searching the Web. There is a wealth of information waiting for you on the internet, if you know the right tools to use and.
1 Web Developer Foundations: Using XHTML Chapter 11 Web Page Promotion Concepts.
Lesson 12 — The Internet and Research
1 Web Developer & Design Foundations with XHTML Chapter 13 Key Concepts.
1 Chapter 11 Implementation. 2 System implementation issues Acquisition techniques Site implementation tools Content management and updating System changeover.
Fall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 7-1 Module II Overview PLANNING: Things to Know BEFORE You Start… Why SEM? Goal Analysis.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Promotion & Cataloguing AGCJ 407 Web Authoring in Agricultural Communications.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1999 Asian Women's Network Training Workshop Tools for Searching Information on the Web  Search Engines  Meta-searchers  Information Gateways  Subject.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
Internet Business Foundations © 2004 ProsoftTraining All rights reserved.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Fourth Edition Discovering the Internet Discovering the Internet Complete Concepts and Techniques, Second Edition Chapter 3 Searching the Web.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
The Business Model of Google MBAA 609 R. Nakatsu.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Searching The Internet Open Text Searching vs. Subject Tree Search Open Text Search Search Engine scans the Web looking for a word or group of words.
Search Tools and Search Engines Searching for Information and common found internet file types.
Web Search Engines AGED Search Engines Search engines (most have directories, too)  Yahoo  AltaVista  Lycos
Search Engines By: Faruq Hasan.
Searching for NZ Information in the Virtual Library Alastair G Smith School of Information Management Victoria University of Wellington.
Chapter 1 Getting Listed. Objectives Understand how search engines work Use various strategies of getting listed in search engines Register with search.
CIW Lesson 6MBSH Mr. Schmidt1.  Define databases and database components  Explain relational database concepts  Define Web search engines and explain.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
W orkshops in I nformation S kills and E lectronic R esources Oxford University Library Services – Information Skills Training Finding quality information.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Third Edition Discovering the Internet Discovering the Internet Complete Concepts and Techniques, Second Edition Chapter 3 Searching the Web.
Lecture 4 Access Tools/Searching Tools. Learning Objectives To define access tools To identify various access tools To be able to formulate a search strategy.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
SEARCH ENGINE by: by: B.Anudeep B.Anudeep Y5CS016 Y5CS016.
Search Engine Optimization
Information Retrieval in Practice
Search Engine Optimization
Search Engines and Search techniques
Search Engine Architecture
Lesson 6: Databases and Web Search Engines
Search Engines & Subject Directories
Search Search Engines Search Engine Optimization Search Interfaces
Data Mining Chapter 6 Search Engines
Lesson 6: Databases and Web Search Engines
Introduction to Information Retrieval
Search Engines & Subject Directories
Search Engines & Subject Directories
Information Search Week 4.
Information Retrieval and Web Design
Presentation transcript:

CP3024 Lecture 12 Search Engines

What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

What is a Search Engine?  A page on the web connected to a backend program  Allows a user to enter words which characterise a required page  Returns links to pages which match the query

A Typical Search Engine

Types of Search Engine  Automatic search engine e.g. Altavista, Lycos  Classified Directory e.g. Yahoo!  Meta-Search Engine e.g. Dogpile

Components of a Search Engine  Robot (or Worm or Spider) –collects pages –checks for page changes  Indexer –constructs a sophisticated file structure to enable fast page retrieval  Searcher –satisfies user queries

Query Interface  Usually a boolean interface –(Fred and Jean) or (Bill and Sam)  Normally allows phrase searches –"Fred Smith"  Also proximity searches  Not generally understood by users  May have extra 'friendlier' features ?

Search Results  Presented as links  Supposedly ordered in terms of relevancy to the query  Some Search Engines score results  Normally organised if groups of ten per page

Problems  Links are often out of date  Usually too many links are returned  Returned links are not very relevant  The Engines don't know about enough pages  Different engines return different results  U.S. bias

Improving query results  To look for a particular page use an unusual phrase you know is on that page  Use phrase queries where possible  Check your spelling!  Progressively use more terms  If you don't find what you want, use another Search Engine!

Who operates Search Engines?  People who can get money from venture capitalists!  Many search engines originate from U.S. universities  Often paid for by advertisements  Engines monitor carefully what else interests you (paid by the click)

How do pages get into a Search Engine?  Robot discovery (following links)  Self submission  Payments

Robot Discovery  Robots visit sites while following links  The more links the more visits  Make sure you don't exclude Robots from visiting public pages

Payments  Some search engines only index paying customers  The more you pay the higher you appear on answers to queries

Self submission  Register your page with a search engine  Pay for a company to register you with many search engines  Get registration with many search engines for free!

Getting to the top  Only relevant queries should be ranked highly  Search engines only look at text  Search engine operators try to stop "search engine spamming"  Some queries are pre-answered

Get where you should be!  Put more than graphics on a page  Don't use frames  Use the tag  Make good use of and  Consider using the tag  Get people to link to your page

Summary  Search Engines are vital to the Web user  Search Engines are not perfect by a long way  There are tactics for better searching  Page design can bring more visitors via Search Engines  The more links the better!

WWLib-TNG A Next Generation Search Engine

In the beginning  WWLib-TOS –Manually constructed directory –Classified on Dewey Decimal –Simple data structure –Proof of concept

The New Architecture

The Classifier

Motive - Why Generate Metadata Automatically?  Meta tags are not compulsory  Old pages are less likely to have meta tags  Available data can be unreliable  The Web of Trust requires comprehensive resource description  An essential prerequisite for widespread deployment of RDF applications

Method - How can Metadata be Generated Automatically?  Using an automatic classifier  The classifier classifies Web Pages according to Dewey Decimal Classification  Other useful metadata can be extracted during the process of automatic classification

Automatic Classification  Intended to combine the intuitive accuracy of manually maintained classified directories with the speed and comprehensive coverage of automated search engines  DDC has been adopted because of its universal coverage, multilingual scope and hierarchical nature

Automatic Classifier - How does it work? Firstly, the page is retrieved from a URL or local file and parsed to produce a document object

Automatic Classifier - How does it work? The document object is then compared with DDC objects representing the top ten DDC classes

Automatic Classifier - How does it work?  Each time a word in the document matches a word in the DDC object, the two associated weights are added to a total score  A measure of similarity is then calculated using a similarity coefficient

Automatic Classifier - How does it work?  If there is a significant measure of similarity the document will be compared with any subclasses of that DDC class  If there are no subclasses (i.e. the DDC class is a leaf node) the document is assigned the classmark  If the result is not significant, the comparison process will proceed no further through that particular branch of the DDC object hierarchy

Metadata elements  The automatic classification process can be used to extract other useful metadata elements other than the classification classmarks:  Keywords  Classmarks  Word count  Title  URL  Abstract  A unique accession number and associated dates can be obtained and supplied by the system

Metadata elements - Wolverhampton Core Wolverhampton CoreDublin Core 1Unique Accession numberIdentifier 2Title 3URLIdentifier 4AbstractDescription 5KeywordsSubject 6ClassmarksSubject 7Word count 8Classification date 9Last modified dateDate

RDF Data Model

RDF Schema  There is a significant overlap with the Dublin Core element set  Requirement for implementation clarity  Those that have Dublin Core equivalents are declared as sub-properties  Maintain interoperability with Dublin Core applications

RDF Schema Keyword Classmark

Classifier Evaluation  Automatic metadata generation will become important for the widespread deployment of RDF based applications  Documents created before the invention of RDF generating authoring tools also need to be described  RDF utilised in this manner may encourage interoperability between search engines  More info:

Current Status of WWLib-TNG  New results interface proposed –R-wheel (CirSA)  Builder and searcher constructed, now being tested  Classifier constructed  Test Dispatcher/Analyser/Archiver in place