Presentation is loading. Please wait.

Presentation is loading. Please wait.

Australian web domain harvests 2005, 2006 & 2007.

Similar presentations


Presentation on theme: "Australian web domain harvests 2005, 2006 & 2007."— Presentation transcript:

1 Australian web domain harvests 2005, 2006 & 2007

2 Igor Ranitovic Internet Archive engineer With Petabox rack For Australian domain harvest

3 PANDORA : Domain Harvesting Australian domain harvest –.au domain, located on Australian servers –Internet Archive 1 st harvest June/July 2005 –4 weeks, 185m files, 6.69 TBs 2 nd harvest Aug/Sept 2006 –5 weeks, 596m files, 19.04 TBs 3 rd harvest Aug/Sept 2007 –4 weeks, 516m files, 18.47 TBs

4 Comparative statistics PANDORA Files:51 million Size:2.12 TB Domain Harvest200520062007 Unique files185,549,662596,238,990516,064,820 Hosts crawled811,5231,046,0381,247,614 Size6.69 TB19.0418.47 TB Domain Harvests Files:1,297 million Size:44.2 TB

5 PANDORA : Domain Harvesting

6 Some pros – –Retains linkages and context –Large scale – more bytes for the buck –Less selectively discriminate Some cons – –High dependence on the crawler technology –Domain and geo-location bias (.au, geoIP) –Limitations in timeliness, quality assurance, scoping, site complexity, deep web –Legal and access issues to resolve

7 PANDORA : Australia’s Web Archive Enormous growth and volume of material Everyone can be creators and publishers Virtually instantaneous publication Dynamic content and format Multiplicity of formats Technology dependent Hyperlinked and interconnected Highly accessible but hard to identify Ephemeral Interactivity, re-use, personalisation (web 2.0)


Download ppt "Australian web domain harvests 2005, 2006 & 2007."

Similar presentations


Ads by Google