Download presentation
Presentation is loading. Please wait.
1
By Morris Wright, Ryan Caplet, Bryan Chapman
2
Overview Crawler-Based Search Engine (A script/bot that searches the web in a methodical, automated manner) (wikipedia, ”web crawler”) Limited to a subset of Uconn’s School of Engineering Websites Resources: Web server and MySQL servers provided by ECS Languages Used: HTML, PHP, SQL, Perl
3
Task Breakdown Bryan Design Crawler Analyze files and fill database with Urls to search Morris Search functionality Database/Account Management Ryan UI Development Ranking Algorithm and Keyword extraction done by group
4
Crawler Summary The crawler creates a ”mirror” of our intended scope of websites on local hard drive Using a script, the title is then extracted from the relavent files and placed into a DB table Another script then visits each url and extracts keywords to populate the second DB table When a user types in a word in the search engine, the word will be queried in the keyword database, and from that word another query will be sent to display all the urls/titles matching that specific keyword
5
Crawler - Wget The linux command, wget is used in our script along with the base domain of www.engr.uconn.edu/ to limit our crawler to sites within the school of engineering www.engr.uconn.edu/ “Wget can follow links in HTML pages and create local versions of remote web sites, fully recreating the directory structure of the original site” (linux.about.com) Our “Mirror” A script is then used to run a recursive call that removes all the tags from the files, preparing them for storage into the database
6
Crawler – Stem Words A script is used to remove all arbitrary ”stem” words and combine like words such as: the if however -ion, -ing, -ier… etc “Running” is the same as “Run” Helps with space in the database
7
Crawler Functionality Once this is accomplished our first database is populated with indexing information and has a layout as seen below. ID Site Index Table URL TITLE Used as a primary key Stores site's url address Stores extracted title
8
Crawler Functionality PHP is then used to loop through all the url listings in our indexing database to create keywords Unwanted HTML syntax is removed and PHP's built-in function array_count_values is used to create a list of keywords and frequency For the time being, these keyword frequencies will be used to determine page rank and ordering on the search page
9
MySQL Database
10
Crawler Functionality Once this list is created for a given website, we then populate our keyword database by either creating a new table for the keyword, or simply adding a new entry into an existing table ID 'Keyword' Table URL Freq Used as a primary key Stores site's url address Stores keyword frequency
11
Sample Keyword Results Consider the following results URL: http://www.uconn.edu/resnet Title: For all your Technology Needs Keyword: technology 4 Keyword: information 10 URL: http://www.uconn.edu/sports Title: For all your Sports Information Keyword: football 10 Keyword: information 12
12
Crawler Functionality Once the databases have been populated, it just needs to be integrated with the search function of the page and the UI to be fully functional The current UI is good for displaying a few results, but we will need something more efficient and better looking when there are hundreds of results
13
Search Function When a word is entered into the search bar, a query of that word is entered into the database If the word is in the database, the query will pull up all the URLs and their associated titles and display them on the page The pages should be ordered by their page rank – the higher the frequency of the keyword, the higher the rank The search function code is written in PHP and the queries are written in SQL
14
Search Function Test
15
Search Function Example
16
Search Function - Mail
17
Search Function - Uconn
18
Search Function – N/a
19
Changes needed for Integration Need to setup the test database fields to match up the criteria of the crawler database Test Database only uses 1 database whereas the crawler database uses 2 – one for the URL/Titles, one for the Keywords Need to work on security measures such as input validation and Hackbar Hackbar is a tool used for testing SQL injections, XSS holes and site security.
20
Questions?
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.