Agenda Overview of the project Resources. CS172 Project crawlingrankingindexing.

2 CS172 Project crawlingrankingindexing

3 Phase 1 Options Web data – Needs to come out with your own crawling strategy Twitter data – Can use third-party for Twitter Streaming API – Still needs some web crawling

4 Download contents of page 1 Parse the downloaded file to extract links the page 2 Store extracted links in the Frontier 4 agelis getNext() Add(List ) getNext addAll(List) Clean and Normalize the extracted links 3 Crawling

5 1. Download File Contents

6 <- This is what you will see when you download a page. Notice HTML Tags. 2. Parsing HTML to extract links

7 2. Parsing HTML file Write your own parser Some suggestions: Parse HTML file as XML. Two Parsing methods – SAX (Simple API for XML) – DOM (Document Object Model) Use existing library – JSoup ( Can be used to download the page. – HTML Parser (

8 2. Parsing HTML file Things to think about – How to handle Malformed HTML? B rowser can still display it, but how do you handle it?

9 3. Clean extracted URLs Some URL entries while crawling /intranet/ /inventthefuture.html news/e-newsletter.html /faculty/ / /about/ #main riverside-ucr-develops-sensory-detection-for-smartphones.ece?ssimg=532988#ssStory533104

10 3. Clean extracted URLs What to avoid Parse only http links (avoid ftp, https or any other protocol) Avoid duplicates – Bookmarks : #main – Bookmarks should be stripped off. – Self paths: / Avoid downloading pdfs or images – /news/GraphenePublicationsIndex.pdf – Its ok to download them, but you cannot parse them. Take care of invalid characters in URLs – Space: hristidis Space: – Ampersand: Ampersand: – These characters should be encoded else you will get a MalformedURLException

11 Normalize Links Found on the page Relative URLs: – These URLs have no host address – E.g. While crawling ( you find urls such as: – Case 1: /find_people.php A “/” at the beginning means path starts from the root of the host ( in this – Case 2: all No “/” means the path is relative to current path. Normalize them (respectively) to – –

12 Clean extracted URLs Different Parts of the URL highlighted with different colors county/riverside/riverside-headlines-index/20120408- riverside-ucr-develops-sensory-detection-for- smartphones.ece?ssimg=532988#ssStory533 Protocol Port Host Path Query Bookmark

13 Has methods that can separate different parts of the URL. getProtocol: http getHost: getPort: -1 getPath: /local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ece getQuery: ssimg=532988 getFile: /local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ece?ssimg=532988

14 Normalizing with You can normalize URLs with simple string manipulations and using methods from class. Here is the snippet for normalizing “Case 1” root relative URLs

15 Crawler Ethics Some websites don’t want crawlers swarming all over them. Why? – Increases load on the server – Private websites – Dynamic websites – …

16 Crawler Ethics How does the website tell you (crawler) if and what is off limits. Two options – Site wide restrictions: robots.txt – Webpage specific restrictions: Meta tag

17 Crawler Ethics robots.txt A file called “robots.txt” in the root directory of the website Example: Format: User-Agent: Disallow: Allow:

18 Crawler Ethics robots.txt What should you do? – Before starting on a new website: – Check if robots.txt exists. – If it does, download it and parse it for all inclusions and exclusions for “generic crawler” i.e. User-Agent: * – Don’t’ crawl anything in the exclusion list including sub-directories

19 Crawler Ethics Website Specific: Meta tags Some webpages have one the following meta- tag entries: Options: – INDEX or NOINDEX – FOLLOW or NOFOLLOW

20 Twitter data collecting Collecting through Twitter Streaming API –, where you can check the data schema. – Rate limit: you will get up to 1% of the whole Twitter traffic. So you can get about 4.3M tweets per day (about 2GB) – You need to have a Twitter account for that. Check

21 Third-party libarary Twitter4jTwitter4j for Java. You can find supports for other languages also. Well documented and code examples. e.g.,

22 Important Fields At least following fields you should save: – Text – Timestamp – Geolocation – User of the tweet – Links

23 Crawl links in Tweets Tweets may contain links. – It may contains useful information. E.g., links to news articles. After collect the tweets, use another process to crawl the links. – Because the crawling is slower, so you may not want to crawl it right after you get the tweet.

