Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Search Engine Basics Mr. Shaw. 2 Search Engine Basics Following is simplified tutorial on search engine basics. Following is simplified tutorial on.

Similar presentations


Presentation on theme: "1 Search Engine Basics Mr. Shaw. 2 Search Engine Basics Following is simplified tutorial on search engine basics. Following is simplified tutorial on."— Presentation transcript:

1 1 Search Engine Basics Mr. Shaw

2 2 Search Engine Basics Following is simplified tutorial on search engine basics. Following is simplified tutorial on search engine basics. Not technically detailed or precise. Not technically detailed or precise. Intended for general students, not computer science majors. Intended for general students, not computer science majors. Intended as “electronic tutorial,” not as “presentation-ready” material. Intended as “electronic tutorial,” not as “presentation-ready” material.

3 3 How search engines work The basics: “Bots” and “indexing” The basics: “Bots” and “indexing” Computers using sophisticated software (“bots” or “spiders”) automatically seek out and “read” webpages. Computers using sophisticated software (“bots” or “spiders”) automatically seek out and “read” webpages. For each webpage, an “Index” is created (somewhat like index in the back of a book, but more complete). For each webpage, an “Index” is created (somewhat like index in the back of a book, but more complete). Process described as “indexing the webpage” Process described as “indexing the webpage”

4 4 Indexing a book (conceptually) Review Page 3 of Friedman’s The World is Flat (R. 3.0). Review Page 3 of Friedman’s The World is Flat (R. 3.0). Note that “IBM” appears on p. 3. Note that “IBM” appears on p. 3. (So do “Columbus,” “Texas Instruments,” etc., etc., so note these as well.) (So do “Columbus,” “Texas Instruments,” etc., etc., so note these as well.) Keep reviewing each page; eventually get to p. 59 Keep reviewing each page; eventually get to p. 59 Note that “IBM” appears on that page as well. Note that “IBM” appears on that page as well. page 3 “Columbus” “IBM” “Texas Instruments” etc. page 59 “Osama Bin Laden” “Afghanistan” “IBM” etc.

5 5 Indexing a book (conceptually) Continue process until you get through entire book. Continue process until you get through entire book. Then you “flip” or reverse your index Then you “flip” or reverse your index Readers don’t want to know “what’s on p. 3 and p. 59,” but, for example, “on what pages do I find ‘IBM,’ or ‘Columbus,’ or ‘Texas Instruments,’ etc. Readers don’t want to know “what’s on p. 3 and p. 59,” but, for example, “on what pages do I find ‘IBM,’ or ‘Columbus,’ or ‘Texas Instruments,’ etc. While this is tedious work, human beings (or at least interns) can do it. While this is tedious work, human beings (or at least interns) can do it. Most books are under 500 pages in length Most books are under 500 pages in length

6 6 Indexing a book (conceptually) When I am done going through the whole book, I then take each important term (e.g. “IBM”) and determine what pages this term is on. When I am done going through the whole book, I then take each important term (e.g. “IBM”) and determine what pages this term is on. This information is included in the index, found at the back of most books. This information is included in the index, found at the back of most books.

7 7 Indexing a website (conceptually) Pages are reviewed by computers, not human beings, but the essential process is very similar Pages are reviewed by computers, not human beings, but the essential process is very similar Review http://www.ibm.com/software Review http://www.ibm.com/software Note that “IBM” appears on this webpage Note that “IBM” appears on this webpage Review http://www.ibm.com/websphere Review http://www.ibm.com/websphere Note that “IBM” appears on this webpage Note that “IBM” appears on this webpage Review www.cdw.com Review www.cdw.com Note that “IBM” appears on this webpage. Note that “IBM” appears on this webpage. Eventually, high percentage of all webpages are “read” by computers; pages where “IBM” appears are identified. Eventually, high percentage of all webpages are “read” by computers; pages where “IBM” appears are identified.

8 8 Indexing a website (conceptually) Computers can “read” webpages much, much faster than human being can. Computers can “read” webpages much, much faster than human being can. Computers collect much more data from a webpage than a human being can. Computers collect much more data from a webpage than a human being can. Not just “IBM” was on this page, but also … Not just “IBM” was on this page, but also … “IBM” appeared on this page immediately adjacent to the word “software.” “IBM” appeared on this page immediately adjacent to the word “software.” “IBM” appeared within 5 words of the word “Microsoft.” “IBM” appeared within 5 words of the word “Microsoft.”

9 9 Indexing a website (conceptually) After the computers have “read” every page they can find, the index is “flipped” much like a book index is. After the computers have “read” every page they can find, the index is “flipped” much like a book index is. Result is a “database” mapping specific terms and other information to webpages. Result is a “database” mapping specific terms and other information to webpages. “Term ‘IBM’ is found on following webpages…” “Term ‘IBM’ is found on following webpages…” “Term ‘IBM’ is within 3 words of term ‘software’ on following webpages …” “Term ‘IBM’ is within 3 words of term ‘software’ on following webpages …”

10 10 Indexing Term “indexing” used frequently; refers to computer “reading” specified content and building an index. Term “indexing” used frequently; refers to computer “reading” specified content and building an index. Indexing can convert a mass of largely useless “stuff” into a very useful resource. Indexing can convert a mass of largely useless “stuff” into a very useful resource. Example: “I would like to index the all transcripts of the CBS Evening News since it’s inception.” Example: “I would like to index the all transcripts of the CBS Evening News since it’s inception.” Now I can identify every broadcast where “Watergate” was mentioned since 1972. Now I can identify every broadcast where “Watergate” was mentioned since 1972. Example: “Google’s Desktop Search tool can index the thousands of files sitting on my hard drive.” Example: “Google’s Desktop Search tool can index the thousands of files sitting on my hard drive.” Now I can find that article I wrote 10 years ago. Now I can find that article I wrote 10 years ago.

11 11 How search engines work In some cases, user-generated metadata (“data about data”) is also utilized. In some cases, user-generated metadata (“data about data”) is also utilized. E.G. “keywords” and “description” fields, which are easily added to a webpage when it is created. E.G. “keywords” and “description” fields, which are easily added to a webpage when it is created. E.G. of description: “This webpage describes how search engines work.” E.G. of description: “This webpage describes how search engines work.” Metadata can be extremely useful, but is also misused to manipulate search engines. Metadata can be extremely useful, but is also misused to manipulate search engines. Example: Scammers add common search terms (“Britney Spears,” etc.) to their metadata, even if webpages have nothing to do with Britney Spears. Example: Scammers add common search terms (“Britney Spears,” etc.) to their metadata, even if webpages have nothing to do with Britney Spears. Value of metadata may be greatest in controlled environments, e.g. intranets, where webpage creators can be trusted not to include misleading metadata. Value of metadata may be greatest in controlled environments, e.g. intranets, where webpage creators can be trusted not to include misleading metadata.

12 12 How search engines work When user submits a query, query is matched to previously-created index. When user submits a query, query is matched to previously-created index. Most basic approach is to just look for similarity between the index and search terms (keywords) contained in query. Most basic approach is to just look for similarity between the index and search terms (keywords) contained in query. Common in early days of search. Common in early days of search. Often fails to provide useful, relevant search results. Often fails to provide useful, relevant search results. Modern search engines use “Secret sauce” to improve results. Modern search engines use “Secret sauce” to improve results. “Secret sauce” = sophisticated algorithms. “Secret sauce” = sophisticated algorithms. Google’s “secret sauce” known as PageRank Google’s “secret sauce” known as PageRank One trick: when to ignore user-generated metadata. One trick: when to ignore user-generated metadata. Search engine optimization vs. search engine manipulation Search engine optimization vs. search engine manipulation


Download ppt "1 Search Engine Basics Mr. Shaw. 2 Search Engine Basics Following is simplified tutorial on search engine basics. Following is simplified tutorial on."

Similar presentations


Ads by Google