Download presentation
Presentation is loading. Please wait.
Published byBeverly Cobb Modified over 8 years ago
1
Web Archiving and Access Mike Smorul Joseph JaJa ADAPT Group University of Maryland, College Park
2
Web Archive Storage and Search Management tools Storage infrastructure Indexing, searching, compression experiments 7/21/2010 NDIIPP Partners Meeting 2
3
Webarc Manager Develop a tool to help manage webarc collections Show statistics of a series of crawls Open API to easily query collection –List all copies of a page, etc 7/21/2010 NDIIPP Partners Meeting 3
4
Manager Components WarcManager (server) –REST-based access to index –Index of DAT/ARC entries –URL Searching, ARC browsing, Javascript Client Simple Web-Accessible Preservation (SWAP) –Web-accessible distributed storage –ARC page retrieval –1Gbps, 2200requests/s 7/21/2010 NDIIPP Partners Meeting 4
5
Manager Design 7/21/2010 NDIIPP Partners Meeting 5 Javascript Client Webarc Manager MySQL Storage Server REST JSON Page Request
6
Manager Screenshots 7/21/2010 NDIIPP Partners Meeting 6 Search Results URL Details WARC File Details
7
Storage Design SWAP – Simple, Web-Accessible Preservation Intelligent placement of files across multiple servers and disk partitions Simple HTTP access, PUT, GET, DELETE Use redirects to provide a uniform namespace Files organized into file groups –Each group resides on multiple partitions (slices) –Hash(file_path) % slices = partition No centralized catalog 7/21/2010 NDIIPP Partners Meeting 7
8
How it works 7/21/2010 NDIIPP Partners Meeting 8 Client Server 1Server 2 P1P2P4P3 GET ‘bag1/warc55.gz’ Calculate hash(bag1/warc55.gz) % 4 = 3 Return 302: Server 2 GET ‘bag1/warc55.gz’Return warc55.gz
9
Performance Good small file and large file performance –Over 2000 requests/s and 3000 redirects/ 7/21/2010 NDIIPP Partners Meeting 9
10
Time Machine for the Web Fast parallel indexer to handle large scale crawled web contents, coupled with a new compression scheme. Fast search of contents based on unstructured queries involving temporal specifications. Presentation of pertinent summary information in ranked order according to the temporal context. 7/21/2010 NDIIPP Partners Meeting 10
11
Parallel Index Construction A parallel and hybrid strategy of multi- core CPU and multi-GPU Processing Speed with a single node: 80MB/s 7/21/2010 NDIIPP Partners Meeting 11 Disk ARC File CPU Parser Disk Dictionary & Postings Lists Dictionary & Postings Lists Parallel CPU Parser CPU Indexer GPU Indexer Parallel CPU/GPU Indexer
12
Index Compression A combination of –integer compression scheme and –multi-versioned document posting w/ temporal information included Potential space savings of temporal information for frequent terms compared to full temporal information: 50%. Potential space saving from integer compression: 4B vs 1B for small integers 7/21/2010 NDIIPP Partners Meeting 12
13
Temporally Anchored Information Retrieval Given a temporally anchored query, return a ranked set of pages within a temporal context. A new approach that: –Substantially limits the search space –Generates efficiently temporally ranked pages Extensive experimental results. 7/21/2010 NDIIPP Partners Meeting 13
14
Additional Information http://adapt.umiacs.umd.edu –Papers, results, etc.. E-mail: msmorul@umiacs.umd.edu 7/21/2010 NDIIPP Partners Meeting 14
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.