Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Search Algorithms By Matt Richard and Kyle Krueger.

Similar presentations


Presentation on theme: "Web Search Algorithms By Matt Richard and Kyle Krueger."— Presentation transcript:

1 Web Search Algorithms By Matt Richard and Kyle Krueger

2 What is it? Basically, a search engine algorithm is a set of rules, or a unique formula, that the search engine uses to determine the significance of a web page, and each search engine has its own set of rules. These rules determine whether a web page is real or just spam, whether it has any significant data that people would be interested in, and many other features to rank and list results for every search query that is begun, to make an organized and informational search engine results page. The algorithms, as they are different for each search engine, are also closely guarded secrets, but there are certain things that all search engine algorithms have in common.

3 Basic Principles A web search algorithm has three major things that it must be able to do: 1) Crawl 2) Index 3) Rank These are each a separate process with their own algorithms and methods.

4 Crawling “Crawling” is the process by which a web page is parsed to determine its contents. It is begun with a list of “seeds” or starting points if you will, and as each page is queried all the hyper-links within the page are added to the queue to also be searched. This was improved upon afterwards by the creation of blacklists and the use of previously visited lists.

5 Crawling The current method uses a priority queue to check websites that are visited more frequently than others or ones that are updated more often. This can be seen in the comparison of the priority level of a website that posts the time of anywhere in the world as opposed to a company archive.

6 Crawling Blacklists are another more recent feature. These lists contain URLs that either do not redirect to where they are supposed to or are malicious in nature. This feature is meant to prevent different types of hacking attempts such as denial of service attacks and malware implantation. Websites on this blacklist can be avoided by crawlers and this may lower the priority of a website to be crawled if it is linked to a page on this list.

7 Indexing Indexing allows for files to be found quickly based on a given search term. Search engines use an inverted file to identify indexing terms quickly. There are two main phases in creating an inverted file: scanning and inversion.

8 Scanning In the scanning phase, the indexer scans each input document's text. The indexer then writes a posting for each indexable term it finds. This posting contains a document number and a term number. The files will naturally be in document number order.

9 Inversion During the inversion phase, the indexer sorts the postings into term number order. The secondary sort key will be the document number. During the inversion phase, the starting point and length of the lists for every entry are recorded.

10 Example Input Strings T[0] = "it is what it is" T[1] = "what is it" T[2] = "it is a banana" Inverted Index "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1}

11 Real Indexers Real indexers store information such as positions and term frequencies in the postings. Efficient indexers scan documents until they run out of available memory, they then write a partial inverted file to disk clears the memory they were using, and repeats the process. Indexers compress data to reduce disk space and memory demands, which also results in faster indexing and query processing.

12

13 Result Quality Simple query processors often return poor results. Result quality can drastically improved if every result is scored by a function that takes into account doccument length, inlink score, anchor text matches, instances of the query term, phrase matches, etc.

14 PageRank PageRank, developed by Sergey Brin and Lawrence Page, was the first algorithm used by Google to rank webpages. PageRank assigns a value to every page based on the number of pages that link to it and the quality of the links.

15

16 PageRank PageRank's algorithm uses this method to determine what a page's rank will be.

17 PageRank PageRank is still the most common rating algorithm used by most of the well known web search engines. This includes Yahoo!, Google, and Bing.

18 Google The most well known web search engine is Google. They were the first to develop a set of algorithms that provided relevant pages when searching. This was due in part to their focus initially on the second and third important functions of a good web searching algorithms. Google was founded in 1997 by Larry Page and Sergei Brin, though they did not become incorporated until September of 1998.

19 More Specifics This is the basic improvement process used by Google to improve their searches. It works so well for them that they are able to update twice daily.

20 Questions?


Download ppt "Web Search Algorithms By Matt Richard and Kyle Krueger."

Similar presentations


Ads by Google