2013.10.12 SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,

Slides:

Advertisements

Similar presentations

Texas Digital Library Services Preservation Network.

Advertisements

Panel 2 – Promoting Re-Use of Scientific Collections John Harrison SHAMAN Project University of Liverpool

Preservation as a Process of a Repository David Tarrant University of Southampton (UK) Preserv Repository Preservation and Interoperability.org.uk.

SLIDE 1FIST Shanghai Digging Into Data: Data Mining for Information Access Ray R. Larson University of California, Berkeley Paul Watry.

Sustainable Preservation Services for Archivists through Distributed Custody Caryn Wojcik State of Michigan Records Management Services.

Towards a Federated Infrastructure for the Preservation and Analysis Archival Data Chien-Yi HOU Richard MARCIANO {chienyi, School.

Archives & Technology Collide: The Carolina Digital Repository Erin O’Meara Electronic Records Archivist University Archives and Records Services University.

IN350 Document Management & Information Steering Introduction to Document Management. Class 1 August 25, 2003 Judith A. Molka-Danielsen

Search Engines and Information Retrieval

Chronopolis: Preserving Our Digital Heritage David Minor UC San Diego San Diego Supercomputer Center.

December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Ray R.

Richard MARCIANO Chien-Yi HOU School of Information and Library Science (SILS) Sustainable Archives & Leveraging Technologies Group (SALT) University of.

Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.

DCAPE Distributed Custodial Archival Preservation Environments ( Chien-Yi HOU Richard MARCIANO UNC Chapel Hill, SILS /

PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.

IN350 Document Management & Info Steering Introduction to Document Management. Class 1 August 27, 2001 Judith A. Molka-Danielsen

SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.

Richard MARCIANO Chien-Yi HOU School of Information and Library Science (SILS) Sustainable Archives & Leveraging Technologies Group (SALT) University of.

SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday.

Data-PASS Shared Catalog Micah Altman & Jonathan Crabtree 1 Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate.

Different approaches to digital preservation Hilde van Wijngaarden Digital Preservation Officer Koninklijke Bibliotheek/ National Library of the Netherlands.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

Teaching Metadata and Networked Information Organization & Retrieval The UNT SLIS Experience William E. Moen School of Library and Information Sciences.

Search Engines and Information Retrieval Chapter 1.

National Partnership for Advanced Computational Infrastructure Digital Library Architecture Reagan Moore Chaitan Baru Amarnath Gupta George Kremenek Bertram.

Hussein Suleman University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS May, 2008 National e-Science Centre Edinburgh Dr Robert.

Exploring the Applicability of Scientific Data Management Tools and Techniques on the Records Management Requirements for the National Archives and Records.

7. Approaches to Models of Metadata Creation, Storage and Retrieval Metadata Standards and Applications.

University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

Richard MarcianoChien-Yi Hou Caryn Wojcik University of University of State of Michigan North Carolina North Carolina Records Management ServicesSALT DCAPE.

FlexElink Winter presentation 26 February 2002 Flexible linking (and formatting) management software Hector Sanchez Universitat Jaume I Ing. Informatica.

Some comments on using research data in the social sciences Paul Lambert, School of Applied Social Science, University of Stirling, 25 March 2013.

GeoMAPP: Using Metadata to Help Preserve Geospatial Content Matt Peters, Utah’s Automated Geographic Reference Center Glen McAninch, Kentucky Department.

GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.

CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool.

SLIDE 1INFOSCALE Hong Kong Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Paul Watry Richard Marciano.

OAIS: From Requirements to Reality at OCLC FLICC / CENDI Symposium, Dec Pam Kircher Product Manager, Digital Archive OCLC Digital & Preservation.

OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.

National Library of the Czech Republic Integration of digital materials into EDL Adolf Knoll National Library of the Czech Republic Helsinki CENL Workshop.

Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California, Berkeley Paul Watry Richard Marciano.

A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --

PDS4 Demonstration Management Council Face-to-Face Flagstaff, AZ August 22-23, 2011 Sean Hardman.

Visualizing JSTOR: Exploring OAI-ORE for Information Topology Navigation CERN Workshop on Innovations in Scholarly Communication (OAI6) 17 th June, 2009.

Leveraging the Expertise of our Staff and the Information Resources We Manage MIT Libraries Visiting Committee April 13, 2005.

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.

Developing a Dark Archive for OJS Journals Yu-Hung Lin, Metadata Librarian for Continuing Resources, Scholarship and Data Rutgers University 1 10/7/2015.

SLIDE 1NaCTeM Launch -Manchester National Center for Text Mining Launch Event Ray R. Larson University of California, Berkeley School of Information.

Data mining in web applications

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

Dag Toppe Larsen UiB/CERN CERN,

Dag Toppe Larsen UiB/CERN CERN,

Building A Repository for Digital Objects

An Overview of Data-PASS Shared Catalog

Spark Presentation.

Joseph JaJa, Mike Smorul, and Sangchul Song

Digital Asset Management Part 15: Summary

Building Search Systems for Digital Library Collections

Flexible Extensible Digital Object Repository Architecture

Flexible Extensible Digital Object Repository Architecture

SCALABLE OPEN ACCESS Hussein Suleman

CS110: Discussion about Spark

Storing and Accessing G-OnRamp’s Assembly Hubs outside of Galaxy

Presentation transcript:

SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California, Berkeley Paul Watry Richard Marciano University of Liverpool University of North Carolina, Chapel Hill

SLIDE 2 Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Goals: –Text mining and NLP techniques to extract content (named Persons, Places, Time Periods/Events) and associate context Data: –Internet Archive Books Collection (with associated MARC where available) ~7.2T –Jstore ~1T –Context sources: SNAC Archival and Library Authority records. Tools –Cheshire 3 – DL Search and Retrieval Framework –iRODS – Policy-driven distributed data storage –Amazon S3 storage and EC2 computing DID Meeting - Montreal

SLIDE 3DID Meeting - Montreal Grid-Based Digital Libraries: Needs Large-scale distributed storage requirements and technologies Organizing distributed digital collections Shared Metadata – standards and requirements Managing distributed digital collections Security and access control Collection Replication and backup Distributed Information Retrieval support and algorithms

SLIDE 4 But… Hasn’t Hadoop and its menagerie already solved everything? –Yes – many tasks can be done now with great scaleup –And No – most Hadoop solutions are batch oriented and not geared towards information access, but more towards summarization –Maybe – we are looking at replacing or supplementing the low-level data management with Hadoop or Spark tools DID Meeting - Montreal

SLIDE 5DID Meeting - Montreal Grid/Cloud IR Issues Want to preserve the same retrieval performance (precision/recall) while hopefully increasing efficiency (I.e. speed) Very large-scale distribution of resources is (still) a challenge for sub-second retrieval Different from most other typical Grid/Cloud processes, IR is potentially less computing intensive and more data intensive In many ways Grid IR replicates the process (and problems) of metasearch or distributed search We have developed the Cheshire3 system to evaluate and manage these issues. The Cheshire3 system is actually one component in a larger Grid-based environment

SLIDE 6DID Meeting - Montreal Cheshire3 Environment or iRODS

SLIDE 7DID Meeting - Montreal Cheshire3 IR Overview XML Information Retrieval Engine –3rd Generation of the UC Berkeley Cheshire system, as co- developed at the University of Liverpool –Uses Python for flexibility and extensibility, but uses C/C++ based libraries for processing speed –Standards based: XML, XSLT, CQL, SRW/U, Z39.50, OAI to name a few –Grid/Cloud capable. Uses distributed configuration files, workflow definitions and PVM or MPI to scale from one machine to thousands of parallel nodes –Free and Open Source Software

SLIDE 8 Cheshire3 Object Model DID Meeting - Montreal

SLIDE 9 Current Version iRODS and C3 on Amazon EC2 and S3 DID Meeting - Montreal Bucket 2 Bucket 1 Amazon S3 iRODS Cache Resource Cache Resource Amazon EC2 Data Ingestion Cheshire3 Indexing Retrieval iCAT Rule Engine Rule Engine Data Presentation

SLIDE 10 Sample demo DID Meeting - Montreal

SLIDE 11DID Meeting - Montreal

SLIDE 12DID Meeting - Montreal

SLIDE 13DID Meeting - Montreal

SLIDE 14DID Meeting - Montreal Summary Indexing and IR work very well in the Grid/Cloud environment, with the expected scaling behavior for multiple processes Still in progress: –We are still processing collecting the books collection from the Internet Archive –We are still extracting place names, personal names, corporate names and linking with reference sources (such as GeoNames, VIAF, and SNAC)

SLIDE 15DID Meeting - Montreal Thank you! iRODS available via Project web site Available via Special thanks to John Harrison (Liverpool), Chien-Yi Hou (UNC), Shreyas and Luis Aguilar (UCB)