Presentation is loading. Please wait.

Presentation is loading. Please wait.

Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014.

Similar presentations


Presentation on theme: "Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014."— Presentation transcript:

1 Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014

2 Internet Memory Foundation Internet Memory Foundation (European Archive) Established in 2004 in Amsterdam and then in Paris: Mission: Preserve Web content by building a shared WA platform Actions: Dissemination, R&D and partnerships with research groups and cultural institutions Open Access Collections: UK National Archives & Parliament, PRONI, CERN The National Library of Ireland, etc. Internet Memory Research Spin-off of IMF established in June 2011 in Paris Mission: Operate large scale or selective crawls & develop new technologies (processing and extraction )

3 Internet Memory Foundation Focused crawling: Automated crawls through the Archivethe.Net shared platform Quality focused crawls : Video capture (You Tube channels), Twitter crawls, complex crawls Large scale crawling Inhouse developed distributed software Scalable crawler: MemoryBot Also designed for focused crawl and complex scoping

4 Research projects Web Archiving and Preservation Living Web Archives (2007-2010) Archives to Community MEMories: (2010-2013) SCAlable Preservation Environment (2010-2013) Webscale data Archiving and Extraction ✓ Living Knowledge (2009-2012) ✓ Longitudinal Analytics of Web Archive data (2010-2013)

5 MemoryBot design (1) Started in 2010 with the support of the LAWA (Longitudinal Analytics of Web Archive data) project URL store designed for large-scale crawls (DRUM) Built in Erlang: distributed and fault-tolerant system language Distributed (consistent hashing) Robust: topology change adaptation, memory usage regulation, process isolation

6 MemoryBot design (2)

7 MemoryBot performance Good throughput and slow decrease 85 resources written per second, slowing to 55 after 4 weeks on a nine 8-core servers cluster (32 GiB of RAM)

8 MemoryBot counters

9

10 MemoryBot – quality Support of HTTPS, retries on server failure, configurable URL canonicalisation Scope: domain suffixes, language, hops sequence, white lists, black lists Priorities Trap detection (URL pattern identification, within PLD duplicate detection)

11 MemoryBot – multi-crawl Easier management Politeness observed across different crawls Better resource utilisation

12 IM Infrastructure Green datacenters Through a collaboration with NoRack Designed for massive storage (petabytes of data) Highly scalable/low consumption Reduces storage and processing costs Repository : HDFS (Hadoop File System): Distributed, fault-tolerant file system Hbase. A distributed key-value index (temporal archives) MapReduce: A distributed execution framework

13 IM Platform (1) Data storage: temporal aspect (versions ) Organised data: Fast and easy access to content Easy processing distribution (Big Data) Several views on same data: Raw, extracted and/or analysed Takes care of data replication: No (W)ARC synchronisation required

14 IM Platform (2) Extensive characterisation and data mining actions: Process and reprocess information any time depending on needs/requests – Extract information such as MIME type, text resources, images metadata, etc.

15 SCAlable Preservation Environment (SCAPE) QA/Preservation challenges? Growing size of web archives Ephemeral and heterogenous content Costly tools/actions  Develop scalable quality assurance tools  Enhance existing characterisation tools

16 Visual automated QA: Pagelizer Visual and structural comparison tool developped by the UPMC as part of SCAPE Trained and enhanced through a collaboration with IMF Wrapped by IMF team to be used at large scale within its platform  Allows comparison of two web pages snapshots  Provides a similarity score as an output

17 Visual automated QA: Pagelizer Tested on 13 000 pairs of URLs (Firefox & Opera) 75% of correct assessment Whole workflow runs for around 4 seconds/pair 2 seconds for screenshot (depends on page rendered) 2 seconds for comparison Performance already cut per 2 since initial tests (map reduce)

18 Next steps Improvements are to be made: Performance Robustness Correctness New test in progress on a large scale crawl: Results to be disseminated to the community through the SCAPE project and through on-site demos (contact IMF)!

19 Thank you. Any questions? http://internetmemory.org - http://archivethe.net florent.carpentier@internetmemory.org leila.medjkoune@internetmemory.org


Download ppt "Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014."

Similar presentations


Ads by Google