Presentation is loading. Please wait.

Presentation is loading. Please wait.

Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

Similar presentations


Presentation on theme: "Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman."— Presentation transcript:

1 Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman

2 Topics Breakdown Updated Task Breakdown Parts of the Search Engine that are within the System Diagram Testing and Integration

3 Task Breakdown Bryan –Crawler –Keyword Generator Morris –Database and Server Administrator –Search Function Ryan –Part of Crawler –Search Function –User Interface All –Testing System Components

4 Topic Breakdown Updated Task Breakdown Parts of the Search Engine that are within the System Diagram Testing and Integration

5 Breakdown of System Components Recursive wget Crawler / Indexer Keyword Generator Search Page

6 Recursive wget Run to recursively run on the Uconn Network Web pages (2800+) pages were downloaded into www folder ~ 3 GB in size

7 The Crawler – new_strip.pl Written in the Perl Programming Language Strips the title of each page and URL and stores them into the Page Index Database Uses File::Basename Library to get titles when none is found.

8 Keyword Generator Uses Index built from the Crawler Stemming Algorithm is used PHP is used to stem the words but Perl is used to interact with the Keywords Database. Filenames: process2.php, fileopen.php, stemming.php and processKeyword.pl

9 Side Topic: Stemming Algorithm Process of finding the root or natural form of a word. Example: “stemmer”, “stemming”, “stemmed” are based on “stem”. “Stem” is the stem. In this case it is going to give us the stems of those word variations

10 Keyword Generator Cont’d Keyword Generator will produce thousands of tables for each word. Those tables will contain URLs and frequencies of those words at that URL. Use of md5 checksum This is what we will be searching from!

11 Search Page Written in HTML and PHP Filenames: index.html and results.php Will access the Database and search the tables for the words specified Uses Quicksort Algorithm to sort results by Frequency Use of md5 checksum to make it search only what was generated by keyword script.

12 Topic Breakdown Updated Task Breakdown Parts of the Search Engine that are within the System Diagram Testing and Integration

13 Diagram

14 Topic Breakdown Updated Task Breakdown Parts of the Search Engine that are within the System Diagram Testing and Integration

15 Testing Entry Criteria Must work adequately for the creator. Once a first party sees it works it is then verified by a second party.

16 Integration Stategy Points All parts of the system are relatively separate. Yet the earlier parts depend on the later parts output. Integration is done as shown in the diagram.

17 Exit Criteria In order for this system to be ready for beta testing: –The search page must be test thoroughly to make sure that it functions correctly also with proper security concerns taken care of as they come up –Make sure that the keyword tables build properly and are able to be accessed by the search page.

18 The End Any Questions, Concerns or Criticisms?


Download ppt "Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman."

Similar presentations


Ads by Google