Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.

Similar presentations


Presentation on theme: "INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014."— Presentation transcript:

1 INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014

2 Programming Assignment #3

3 Web Crawler Offline processing Dashboard

4 Web Crawler Google – Crawl websites & index for search engine Amazon – Crawl web to price match w/ Amazon’s price Aggregate content – Shopping (Nextag) – News (finance.google.com)

5 Offline processing & Dashboard Offline/async processing – Facebook Lookback – Twitter fire hose and analyze sentiments – YouTube video compression (upload then compress) – Anything that takes > 5s to load => do offline! Dashboard – Easy way to see status of offline processing

6 Final Product Azure Cloud Service Worker Role Web Role Web Role -dashboard.aspx -status, #urls, last 10, etc -admin.asmx -ClearIndex -Also stops current crawling -StartCrawling -GetPageTitle Worker Role -Read URL from Queue -Crawl websites -Store title to Table -Add URLs found to Queue

7 Great User Experience Refresh dashboard => gets me new data ASMX admin page should return relevant status such as “Index Cleared” instead of void/empty string, consider other cases. Remove duplicates Only crawl websites in the same domain as your seed URL.

8 Start Now! (ok… after PA2)

9 Deliverables Due on May 19, 11pm PST Submit on Canvas Please submit the following as a single zip file: URL to your Azure instance hosting the dashboard (readme.txt), make sure crawling is complete! URL to your GitHub repro (share your GitHub with me & TA) in readme.txt Visual Studio 2013 project & source code Screenshot of your Azure dashboard with Instance running (azure- compute.jpg) Write up explaining how you implemented everything. Make sure to address each of the requirements, writeup.txt (~500 words) Extra credits – short paragraph in extracredits.txt for each extra credit (how to see/trigger/evaluate/run your extra credit feature and how you implemented it)

10 Hint Respect robots.txt (google it, it’s a simple format) Only need to crawl pages in the same domain Keep a list of already visited URLs, don’t re-crawl them, store in a fast lookup data structure Think about where to store stats Your code should handle 2+ worker threads. Think about concurrency in updating dashboard stats Local hosting/debugging = run as Admin

11 Sitemaps Start with these 2 robots.txt & sitemaps (http://www.cnn.com/robots.txt and http://sportsillustrated.cnn.com/robots.txt) For the CNN.com sitemap, ignore URLs > 2 months old; for the sportsillustrated sitemap, ignore non-nba related URLs

12 Extra Credit [10pts] Multi-threaded crawler [10pts] Crawl & index HTML body text (remove HTML tags)* [10pts] Graphical dashboard (shows stats over time) [5pts] Crawl more root domains (imdb, forbes, bbc, espn, Wikipedia, 1 pts per domain)

13 Questions?


Download ppt "INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014."

Similar presentations


Ads by Google