Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Archiving and Access Mike Smorul Joseph JaJa ADAPT Group University of Maryland, College Park.

Similar presentations


Presentation on theme: "Web Archiving and Access Mike Smorul Joseph JaJa ADAPT Group University of Maryland, College Park."— Presentation transcript:

1 Web Archiving and Access Mike Smorul Joseph JaJa ADAPT Group University of Maryland, College Park

2 Web Archive Storage and Search Management tools Storage infrastructure Indexing, searching, compression experiments 7/21/2010 NDIIPP Partners Meeting 2

3 Webarc Manager Develop a tool to help manage webarc collections Show statistics of a series of crawls Open API to easily query collection –List all copies of a page, etc 7/21/2010 NDIIPP Partners Meeting 3

4 Manager Components WarcManager (server) –REST-based access to index –Index of DAT/ARC entries –URL Searching, ARC browsing, Javascript Client Simple Web-Accessible Preservation (SWAP) –Web-accessible distributed storage –ARC page retrieval –1Gbps, 2200requests/s 7/21/2010 NDIIPP Partners Meeting 4

5 Manager Design 7/21/2010 NDIIPP Partners Meeting 5 Javascript Client Webarc Manager MySQL Storage Server REST JSON Page Request

6 Manager Screenshots 7/21/2010 NDIIPP Partners Meeting 6 Search Results URL Details WARC File Details

7 Storage Design SWAP – Simple, Web-Accessible Preservation Intelligent placement of files across multiple servers and disk partitions Simple HTTP access, PUT, GET, DELETE Use redirects to provide a uniform namespace Files organized into file groups –Each group resides on multiple partitions (slices) –Hash(file_path) % slices = partition No centralized catalog 7/21/2010 NDIIPP Partners Meeting 7

8 How it works 7/21/2010 NDIIPP Partners Meeting 8 Client Server 1Server 2 P1P2P4P3 GET ‘bag1/warc55.gz’ Calculate hash(bag1/warc55.gz) % 4 = 3 Return 302: Server 2 GET ‘bag1/warc55.gz’Return warc55.gz

9 Performance Good small file and large file performance –Over 2000 requests/s and 3000 redirects/ 7/21/2010 NDIIPP Partners Meeting 9

10 Time Machine for the Web Fast parallel indexer to handle large scale crawled web contents, coupled with a new compression scheme. Fast search of contents based on unstructured queries involving temporal specifications. Presentation of pertinent summary information in ranked order according to the temporal context. 7/21/2010 NDIIPP Partners Meeting 10

11 Parallel Index Construction A parallel and hybrid strategy of multi- core CPU and multi-GPU Processing Speed with a single node: 80MB/s 7/21/2010 NDIIPP Partners Meeting 11 Disk ARC File CPU Parser Disk Dictionary & Postings Lists Dictionary & Postings Lists Parallel CPU Parser CPU Indexer GPU Indexer Parallel CPU/GPU Indexer

12 Index Compression A combination of –integer compression scheme and –multi-versioned document posting w/ temporal information included Potential space savings of temporal information for frequent terms compared to full temporal information: 50%. Potential space saving from integer compression: 4B vs 1B for small integers 7/21/2010 NDIIPP Partners Meeting 12

13 Temporally Anchored Information Retrieval Given a temporally anchored query, return a ranked set of pages within a temporal context. A new approach that: –Substantially limits the search space –Generates efficiently temporally ranked pages Extensive experimental results. 7/21/2010 NDIIPP Partners Meeting 13

14 Additional Information http://adapt.umiacs.umd.edu –Papers, results, etc.. E-mail: msmorul@umiacs.umd.edu 7/21/2010 NDIIPP Partners Meeting 14


Download ppt "Web Archiving and Access Mike Smorul Joseph JaJa ADAPT Group University of Maryland, College Park."

Similar presentations


Ads by Google