Presentation is loading. Please wait.

Presentation is loading. Please wait.

2013.10.12 SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,

Similar presentations


Presentation on theme: "2013.10.12 SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,"— Presentation transcript:

1 2013.10.12 SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California, Berkeley Paul Watry Richard Marciano University of Liverpool University of North Carolina, Chapel Hill

2 2013.10.12 SLIDE 2 Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Goals: –Text mining and NLP techniques to extract content (named Persons, Places, Time Periods/Events) and associate context Data: –Internet Archive Books Collection (with associated MARC where available) ~7.2T –Jstore ~1T –Context sources: SNAC Archival and Library Authority records. Tools –Cheshire 3 – DL Search and Retrieval Framework –iRODS – Policy-driven distributed data storage –Amazon S3 storage and EC2 computing DID Meeting - Montreal

3 2013.10.12 SLIDE 3DID Meeting - Montreal Grid-Based Digital Libraries: Needs Large-scale distributed storage requirements and technologies Organizing distributed digital collections Shared Metadata – standards and requirements Managing distributed digital collections Security and access control Collection Replication and backup Distributed Information Retrieval support and algorithms

4 2013.10.12 SLIDE 4 But… Hasn’t Hadoop and its menagerie already solved everything? –Yes – many tasks can be done now with great scaleup –And No – most Hadoop solutions are batch oriented and not geared towards information access, but more towards summarization –Maybe – we are looking at replacing or supplementing the low-level data management with Hadoop or Spark tools DID Meeting - Montreal

5 2013.10.12 SLIDE 5DID Meeting - Montreal Grid/Cloud IR Issues Want to preserve the same retrieval performance (precision/recall) while hopefully increasing efficiency (I.e. speed) Very large-scale distribution of resources is (still) a challenge for sub-second retrieval Different from most other typical Grid/Cloud processes, IR is potentially less computing intensive and more data intensive In many ways Grid IR replicates the process (and problems) of metasearch or distributed search We have developed the Cheshire3 system to evaluate and manage these issues. The Cheshire3 system is actually one component in a larger Grid-based environment

6 2013.10.12 SLIDE 6DID Meeting - Montreal Cheshire3 Environment or iRODS

7 2013.10.12 SLIDE 7DID Meeting - Montreal Cheshire3 IR Overview XML Information Retrieval Engine –3rd Generation of the UC Berkeley Cheshire system, as co- developed at the University of Liverpool –Uses Python for flexibility and extensibility, but uses C/C++ based libraries for processing speed –Standards based: XML, XSLT, CQL, SRW/U, Z39.50, OAI to name a few –Grid/Cloud capable. Uses distributed configuration files, workflow definitions and PVM or MPI to scale from one machine to thousands of parallel nodes –Free and Open Source Software

8 2013.10.12 SLIDE 8 Cheshire3 Object Model DID Meeting - Montreal

9 2013.10.12 SLIDE 9 Current Version iRODS and C3 on Amazon EC2 and S3 DID Meeting - Montreal Bucket 2 Bucket 1 Amazon S3 iRODS Cache Resource Cache Resource Amazon EC2 Data Ingestion Cheshire3 Indexing Retrieval iCAT Rule Engine Rule Engine Data Presentation

10 2013.10.12 SLIDE 10 Sample demo DID Meeting - Montreal

11 2013.10.12 SLIDE 11DID Meeting - Montreal

12 2013.10.12 SLIDE 12DID Meeting - Montreal

13 2013.10.12 SLIDE 13DID Meeting - Montreal

14 2013.10.12 SLIDE 14DID Meeting - Montreal Summary Indexing and IR work very well in the Grid/Cloud environment, with the expected scaling behavior for multiple processes Still in progress: –We are still processing collecting the books collection from the Internet Archive –We are still extracting place names, personal names, corporate names and linking with reference sources (such as GeoNames, VIAF, and SNAC)

15 2013.10.12 SLIDE 15DID Meeting - Montreal Thank you! iRODS available via https://www.irods.org Project web site http://diggingintodata.web.unc.edu Available via https://github.com/cheshire3 Special thanks to John Harrison (Liverpool), Chien-Yi Hou (UNC), Shreyas and Luis Aguilar (UCB)


Download ppt "2013.10.12 SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,"

Similar presentations


Ads by Google