Presentation is loading. Please wait.

Presentation is loading. Please wait.

Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris.

Similar presentations


Presentation on theme: "Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris."— Presentation transcript:

1 Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris

2 The problem Certain sites change very frequently ▫News sites especially While we can capture all the stories by visiting once per day, week, month or even year they may have been modified several times and the front page changes will be missed

3 RSS feed advantages Changes to the feed is highly likely to signify an actual change has occurred A single RSS feed informs on changes both to the presumed “front page” as well as article or item pages RSS feeds are generally smaller (in bytes) then the front page (just html) of a site ▫Crawling the RSS feed frequently is more likely to be tolerated

4 How it works 1/4 On first load all feed elements are loaded ▫A feed element is uniquely identified by its  URL  Timestamp Each element plus front page is visited ▫Embeds are downloaded ▫No further links are followed ▫Strict controls need to be in place to halt scope leakage  Each feed element should lead to a very finite number of URLs to crawl  Basically, just get minimal embedds, do not follow links

5 How it works 2/4 Once all the URLs generated by the initial feed elements have been crawled the RSS feed may be revisited ▫IF the minimum wait between visits has elapsed ▫ELSE wait until the minimum time has elapsed The second visit will (probably) show many already seen elements ▫Identified by url+timestamp ▫If feed is entirely unchanged than the content hash will likely be unchanged ▫If an url has a new timestamp it is probable that the content of the item has changed ▫Only load items that have a timestamp that is more recent than the ‚most recently seen‘ timestamp for each feed

6 How it works 3/4 If there are changed or new elements ▫Fetch ‘front page’ URI and URIs of changed and new elements  If they match existing content hashes, they may be discarded, otherwise written to (W)ARCs. ▫Do not revisit embedded content that we have already crawled  This massively reduces the amount of time it takes to complete each RSS visit

7 How it works 4/4 Once visit 2 is over ▫Check has minimum wait elapsed, ▫rinse, ▫repeat

8 Sites Many sites have multiple feeds Sometimes items will appear in more than one feed at a time It is therefor possible to have multiple related feeds for one site Such feeds are always crawled jointly and duplicate items are discarded

9 Example RSS Site: ruv.is State: HOLD_FOR_FEED_EMIT Number of discovered items: 0 Minimum wait between emitting feeds (ms): 600000 Earliest next feed emission: Mon May 12 14:49:48 GMT 2014 URLs being crawled: 0 Feeds last emitted: Mon May 12 14:39:48 GMT 2014 Feeds: Feed: http://www.ruv.is/rss/frettir Most recent seen: Mon May 12 14:24:34 GMT 2014 http://www.ruv.is/ Feed: http://www.ruv.is/rss/erlent Most recent seen: Mon May 12 14:11:50 GMT 2014 http://www.ruv.is/ http://www.ruv.is/erlent Feed: http://www.ruv.is/rss/sport Most recent seen: Sun May 11 22:55:17 GMT 2014 http://www.ruv.is/ http://www.ruv.is/ithrottir Feed: http://www.ruv.is/rss/innlent Most recent seen: Mon May 12 14:24:34 GMT 2014 http://www.ruv.is/ http://www.ruv.is/innlent

10 Configuration Either via Heritrix’s CXML Or using the database interface ▫Maintaining the DB is outside the scope of the add-on Easy to add not configuration handlers

11 Crawl RSS - Heritrix 3 add-on Available on GitHub: ▫https://github.com/Landsbokasafn/crawlrsshttps://github.com/Landsbokasafn/crawlrss Requires Heritrix 3.1.2 or newer Stable, but still technically in ‘beta’ In use at NULI for almost a year now ▫First new sites ▫Now also select blogs and government sites

12

13

14


Download ppt "Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris."

Similar presentations


Ads by Google