Presentation is loading. Please wait.

Presentation is loading. Please wait.

By Morris Wright, Ryan Caplet, Bryan Chapman. Overview  Crawler-Based Search Engine (A script/bot that searches the web in a methodical, automated manner)

Similar presentations


Presentation on theme: "By Morris Wright, Ryan Caplet, Bryan Chapman. Overview  Crawler-Based Search Engine (A script/bot that searches the web in a methodical, automated manner)"— Presentation transcript:

1 By Morris Wright, Ryan Caplet, Bryan Chapman

2 Overview  Crawler-Based Search Engine (A script/bot that searches the web in a methodical, automated manner) (wikipedia, ”web crawler”)‏  Limited to a subset of Uconn’s School of Engineering Websites  Resources: Web server and MySQL servers provided by ECS  Languages Used: HTML, PHP, SQL, Perl

3 Task Breakdown  Bryan Design Crawler Analyze files and fill database with Urls to search  Morris Search functionality Database/Account Management  Ryan UI Development  Ranking Algorithm and Keyword extraction done by group

4 Crawler Summary  The crawler creates a ”mirror” of our intended scope of websites on local hard drive  Using a script, the title is then extracted from the relavent files and placed into a DB table  Another script then visits each url and extracts keywords to populate the second DB table  When a user types in a word in the search engine, the word will be queried in the keyword database, and from that word another query will be sent to display all the urls/titles matching that specific keyword

5 Crawler - Wget  The linux command, wget is used in our script along with the base domain of www.engr.uconn.edu/ to limit our crawler to sites within the school of engineering www.engr.uconn.edu/  “Wget can follow links in HTML pages and create local versions of remote web sites, fully recreating the directory structure of the original site” (linux.about.com)  Our “Mirror”  A script is then used to run a recursive call that removes all the tags from the files, preparing them for storage into the database

6 Crawler – Stem Words  A script is used to remove all arbitrary ”stem” words and combine like words such as: the if however -ion, -ing, -ier… etc “Running” is the same as “Run”  Helps with space in the database

7 Crawler Functionality  Once this is accomplished our first database is populated with indexing information and has a layout as seen below. ID Site Index Table URL TITLE Used as a primary key Stores site's url address Stores extracted title

8 Crawler Functionality  PHP is then used to loop through all the url listings in our indexing database to create keywords  Unwanted HTML syntax is removed and PHP's built-in function array_count_values is used to create a list of keywords and frequency  For the time being, these keyword frequencies will be used to determine page rank and ordering on the search page

9 MySQL Database

10 Crawler Functionality  Once this list is created for a given website, we then populate our keyword database by either creating a new table for the keyword, or simply adding a new entry into an existing table ID 'Keyword' Table URL Freq Used as a primary key Stores site's url address Stores keyword frequency

11 Sample Keyword Results  Consider the following results  URL: http://www.uconn.edu/resnet Title: For all your Technology Needs Keyword: technology 4 Keyword: information 10  URL: http://www.uconn.edu/sports Title: For all your Sports Information Keyword: football 10 Keyword: information 12

12 Crawler Functionality  Once the databases have been populated, it just needs to be integrated with the search function of the page and the UI to be fully functional  The current UI is good for displaying a few results, but we will need something more efficient and better looking when there are hundreds of results

13 Search Function  When a word is entered into the search bar, a query of that word is entered into the database  If the word is in the database, the query will pull up all the URLs and their associated titles and display them on the page  The pages should be ordered by their page rank – the higher the frequency of the keyword, the higher the rank  The search function code is written in PHP and the queries are written in SQL

14 Search Function Test

15 Search Function Example

16 Search Function - Mail

17 Search Function - Uconn

18 Search Function – N/a

19 Changes needed for Integration  Need to setup the test database fields to match up the criteria of the crawler database  Test Database only uses 1 database whereas the crawler database uses 2 – one for the URL/Titles, one for the Keywords  Need to work on security measures such as input validation and Hackbar  Hackbar is a tool used for testing SQL injections, XSS holes and site security.

20 Questions?


Download ppt "By Morris Wright, Ryan Caplet, Bryan Chapman. Overview  Crawler-Based Search Engine (A script/bot that searches the web in a methodical, automated manner)"

Similar presentations


Ads by Google