2007.03.29 SLIDE 1ISGC 2007 - Taipei, Taiwan Grid-based Search and Data Mining Using Cheshire3 In collaboration with Robert Sanderson University of Liverpool.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
SLIDE 1FIST Shanghai Digging Into Data: Data Mining for Information Access Ray R. Larson University of California, Berkeley Paul Watry.
Information Retrieval in Practice
Search Engines and Information Retrieval
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Hanoi, Dec 6, 2008ECAI-PNC Laptops1 Laptops and Libraries: Decentralized Access to Explanatory Resources Michael Buckland University of California, Berkeley.
Access to Digital Heritage Resources using What, Where, When and Who Michael Buckland Electronic Cultural Atlas Initiative University of California, Berkeley.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Nov 15, 2005Ohio State University Libraries1 What, Where, When, and Who: A Renaissance for the Reference Collection Michael Buckland School of Information.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 245 – Spring 2009 Codes and Rules for Description: History University of California, Berkeley School of Information IS 245: Organization.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Scalable Text Mining with Sparse Generative Models
Oct 2, 2008SALT2, Uppsala1 The Educational Role of the Library in a Digital Environment Part II: Design for Learning. Michael Buckland NORSLIS Visiting.
ECAI – CAA Conference, Fargo, April 19, 2006 Geo-temporal Indexing: Events, Lives, and Geographical Features Michael Buckland also Kim Carl, Sarah Ellinger.
Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,
SLIDE 1IS 257 – Fall 2007 Codes and Rules for Description: History University of California, Berkeley School of Information IS 245: Organization.
Overview of Search Engines
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Search Engines and Information Retrieval Chapter 1.
SLIDE 1IS 240 – Spring 2013 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS May, 2008 National e-Science Centre Edinburgh Dr Robert.
Exploring the Applicability of Scientific Data Management Tools and Techniques on the Records Management Requirements for the National Archives and Records.
1 Technologies for distributed systems Andrew Jones School of Computer Science Cardiff University.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,
Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida
Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.
CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool.
SLIDE 1INFOSCALE Hong Kong Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Paul Watry Richard Marciano.
The Mint Mapping tool The MoRe aggregator Vassilis Tzouvaras, Dimitris Gavrilis National Technical University of Athens Digital Curation Unit - IMIS, Athena.
Introduction to The Storage Resource.
Millman—Nov 04—1 An Update on Digital Libraries David Millman Director of Research & Development Academic Information Systems Columbia University
A Resource Discovery Service for the Library of Texas Requirements, Architecture, and Interoperability Testing William E. Moen, Ph.D. Principal Investigator.
Find Research Data b2find.eudat.eu B2FIND User Training How to find data objects and collections using EUDAT’s B2FIND This work is licensed.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
Satisfying Requirements BPF for DRA shall address: –DAQ Environment (Eclipse RCP): Gumtree ISEE workbench integration; –Design Composing and Configurability,
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Developing GRID Applications GRACE Project
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
Linked Library (+AM) Data Presented LITA Next-Generation Catalog IG Corey A Harper Publish, Enrich, Relate and Un-Silo.
SLIDE 1NaCTeM Launch -Manchester National Center for Text Mining Launch Event Ray R. Larson University of California, Berkeley School of Information.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Information Retrieval in Practice
Information Retrieval in Practice
Search Engine Architecture
Information Retrieval and Web Search
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Information Retrieval and Web Search
Panagiotis G. Ipeirotis Tom Barry Luis Gravano
Dept. of Computer Science University of Liverpool
PROJECTS SUMMARY PRESNETED BY HARISH KUMAR JANUARY 10,2018.
Introduction to Information Retrieval
Storing and Accessing G-OnRamp’s Assembly Hubs outside of Galaxy
Web archives as a research subject
Introduction to Search Engines
Presentation transcript:

SLIDE 1ISGC Taipei, Taiwan Grid-based Search and Data Mining Using Cheshire3 In collaboration with Robert Sanderson University of Liverpool Department of Computer Science Presented by Ray R. Larson University of California, Berkeley School of Information

SLIDE 2ISGC Taipei, Taiwan Overview Introduction Context Architecture Grid Text Mining Data Mining Applications Future Plans and Applications Questions?

SLIDE 3ISGC Taipei, Taiwan Introduction Cheshire History: –Developed at UC Berkeley originally –Solution for library data (C1), then SGML (C2), then XML –Monolithic applications for indexing and retrieval server in C + TCL scripting Cheshire3: –Developed at Liverpool, plus Berkeley –XML, Unicode, Grid scalable: Standards based –Object Oriented Framework –Easy to develop and extend in Python

SLIDE 4ISGC Taipei, Taiwan Introduction Today: –Version –Mostly stable, but needs thorough QA and docs –Grid, NLP and Classification algorithms integrated Near Future: –June: Version 1.0 Further DM/TM integration, docs, unit tests, stability –December: Version 1.1 Grid out-of-the-box, configuration GUI

SLIDE 5ISGC Taipei, Taiwan Context Environmental Requirements: –Very Large scale information systems Terabyte scale (Data Grid) Computationally expensive processes (Comp. Grid) Digital Preservation Analysis of data, not just retrieval (Data/Text Mining) Ease of Extensibility, Customizability (Python) Open Source Integrate not Re-implement "Web 2.0" – interactivity and dynamic interfaces

SLIDE 6ISGC Taipei, Taiwan Context Data Grid Layer Data Grid SRB iRODS Digital Library Layer Application Layer Web Browser Multivalent Dedicated Client User Interface Apache+ Mod_Python+ Cheshire3 Protocol Handler Process Management Kepler Cheshire3 Query Results Query Results ExportParse Document Parsers Multivalent,... Natural Language Processing Information Extraction Text Mining Tools Tsujii Labs,... Classification Clustering Data Mining Tools Orange, Weka,... Query Results Search / Retrieve Index / Store Information System Cheshire3 User Interface MySRB PAWN Process Management Kepler iRODS rules Term Management Termine WordNet... Store

SLIDE 7ISGC Taipei, Taiwan Cheshire3 Object Model UserStore User ConfigStore Object Database Query Record Transformer Records Protocol Handler Normaliser IndexStore Terms Server Document Group Ingest Process Documents Index RecordStore Parser Document Query ResultSet DocumentStore Document PreParser Extracter

SLIDE 8ISGC Taipei, Taiwan Object Configuration One XML 'record' per non-data object Very simple base schema, with extensions as needed Identifiers for objects unique within a context (e.g., unique at individual database level, but not necessarily between all databases) Allows workflows to reference by identifier but act appropriately within different contexts. Allows multiple administrators to define objects without reference to each other

SLIDE 9ISGC Taipei, Taiwan Grid Focus on ingest, not discovery (yet) Instantiate architecture on every node Assign one node as master, rest as slaves. Master then divides the processing as appropriate. Calls between slaves possible Calls as small, simple as possible: (objectIdentifier, functionName, *arguments) Typically: ('workflow-id', 'process', 'document-id')

SLIDE 10ISGC Taipei, Taiwan Grid Architecture Master Task Slave Task 1 Slave Task N Data Grid GPFS Temporary Storage (workflow, process, document) fetch document document extracted data

SLIDE 11ISGC Taipei, Taiwan Grid Architecture - Phase 2 Master Task Slave Task 1 Slave Task N Data Grid GPFS Temporary Storage (index, load) store index fetch extracted data

SLIDE 12ISGC Taipei, Taiwan Workflow Objects Written as XML within the configuration record. Rewrites and compiles to Python code on object instantiation Current instructions: –object –assign –fork –for-each –break/continue –try/except/raise –return –log (= send text to default logger object) Yes, no if!

SLIDE 13ISGC Taipei, Taiwan Workflow example workflow.SimpleWorkflow Unparsable Record ”Loaded Record:” + input.id

SLIDE 14ISGC Taipei, Taiwan Text Mining Integration of Natural Language Processing tools Including: –Part of Speech taggers (noun, verb, adjective,...) –Phrase Extraction –Deep Parsing (subject, verb, object, preposition,...) –Linguistic Stemming (is/be fairy/fairy vs is/is fairy/fairi) Planned: Information Extraction tools

SLIDE 15ISGC Taipei, Taiwan Data Mining Integration of toolkits difficult unless they support sparse vectors as input - text is high dimensional, but has lots of zeroes Focus on automatic classification for predefined categories rather than clustering Algorithms integrated/implemented: –Perceptron, Neural Network (pure python) –Naïve Bayes (pure python) –SVM (libsvm integrated with python wrapper) –Classification Association Rule Mining (Java)

SLIDE 16ISGC Taipei, Taiwan Data Mining Modelled as multi-stage PreParser object (training phase, prediction phase) Plus need for AccumulatingDocumentFactory to merge document vectors together into single output for training some algorithms (e.g., SVM) Prediction phase attaches metadata (predicted class) to document object, which can be stored in DocumentStore Document vectors generated per index per document, so integrated NLP document normalization for free

SLIDE 17ISGC Taipei, Taiwan Data Mining + Text Mining Testing integrated environment with 500,000 medline abstracts, using various NLP tools, classification algorithms, and evaluation strategies. Computational grid for distributing expensive NLP analysis Results show better accuracy with fewer attributes:

SLIDE 18ISGC Taipei, Taiwan Applications (1) Automated Collection Strength Analysis Primary aim: Test if data mining techniques could be used to develop a coverage map of items available in the London libraries. The strengths within the library collections were automatically determined through enrichment and analysis of bibliographic level metadata records. This involved very large scale processing of records to: –Deduplicate millions of records –Enrich deduplicated records against database of 45 million –Automatically reclassify enriched records using machine learning processes (Naïve Bayes)

SLIDE 19ISGC Taipei, Taiwan Applications (1) Data mining enhances collection mapping strategies by making a larger proportion of the data usable, by discovering hidden relationships between textual subjects and hierarchically based classification systems. The graph shows the comparison of numbers of books classified in the domain of Psychology originally and after enhancement using data mining

SLIDE 20ISGC Taipei, Taiwan Applications (2) Assessing the Grade Level of NSDL Education Material The National Science Digital Library has assembled a collection of URLs that point to educational material for scientific disciplines for all grade levels. These are harvested into the SRB data grid. Working with SDSC we assessed the grade-level relevance by examining the vocabulary used in the material present at each registered URL. We determined the vocabulary-based grade-level with the Flesch-Kincaid grade level assessment. The domain of each website was then determined using data mining techniques (TF-IDF derived fast domain classifier). This processing was done on the Teragrid cluster at SDSC.

SLIDE 21ISGC Taipei, Taiwan Applications (2) The formula for the Flesch Reading Ease Score: FRES = –1.015 ((total words)/(total sentences)) – 84.6 ((total syllables)/(total words)) The Flesch-Kincaid Grade Level Formula: FKGLF = 0.39 * ((total words)/(total sentences)) * ((total syllables)/(total words)) –15.59 The Domain was determined by: –Domains used were based upon the AAAS Benchmarks –Taking in samples from each of the domain areas being examined and produces scored and ranked lists of vocabularies for each domain. –Each token in a document is passed through a lookup function against this table and tallies are calculated for the entire document. –These tallies are then used to rank the order of likelihood of the document being about each topic and a statistical pass of the results returns only those topics that are above in certain threshold.

SLIDE 22ISGC Taipei, Taiwan Future Plans IR Testing and Optimization –Work with the OCA Book collection as part of INEX 2007 –TREC, CLEF, and INEX Benchmarking Integration of Geographic Information Retrieval methods from Cheshire II –GIR Ranking and Gazetteer-based text retrieval using NLP methods Pattern-driven text mining methods for extracting biographical information from texts –IMLS-funded “Bringing Lives to Light” project

SLIDE 23ISGC Taipei, Taiwan Overview Bringing Lives to Light –Focusing on the Who in Who, What, Where and When –Examining and extending of various types of Biographical Markup –Mining biographical data from available information resources to fill our extended markup databases

SLIDE 24ISGC Taipei, Taiwan WHEN, WHERE and WHO Catalog records found from a time period search commonly include names of persons important at that time. Their names can be forwarded to, e.g., biographies in the Wikipedia encyclopedia.

SLIDE 25ISGC Taipei, Taiwan Place and time are broadly important across numerous tools and genres including, e.g. Language atlases, Library catalogs, Biographical dictionaries, Bibliographies, Archival finding aids, Museum records, etc., etc. Biographical dictionaries are also heavy on place and time: Emanuel Goldberg, Born Moscow PhD under Wilhelm Ostwald, Univ. of Leipzig, Director, Zeiss Ikon, Dresden, Moved to Palestine Died Tel Aviv, Life as a series of episodes involving Activity (WHAT), WHERE, WHEN, and WHO else.

SLIDE 26ISGC Taipei, Taiwan A new form of biographical dictionary would link to all Texts Numeric datasets Thesaurus/ Ontology GazetteerscaptionsMaps/ Geo Data EVI Time Period Directory Time lines, Chronologies Biographical Dictionary

SLIDE 27ISGC Taipei, Taiwan “Lives” Projected Work Develop XML markup for Biographical Events Most likely to be adaptation and extension of existing biographical event markup –Example: EAC/EAD Harvest biographical resources –Wikipedia, etc. Integrate as next generation of current interface

SLIDE 28ISGC Taipei, Taiwan EAC/EAD Biographical Note 1892, May 7 Born, Glencoe, Ill A.B., Yale University, New Haven, Conn Married Ada Hitchcock Served in United States Army

SLIDE 29ISGC Taipei, Taiwan Wikipedia data Life events metadata WHAT: Actions prisoner WHERE: Places Holstein WHEN: Times WHO: People Margaret Sambiria Need external links

SLIDE 30ISGC Taipei, Taiwan

SLIDE 31ISGC Taipei, Taiwan A Metadata Infrastructure CATALOGS Achives Historical Societies Libraries Museums Public Television Publishers Booksellers Audio Images Numeric Data Objects Texts Virtual Reality Webpages RESOURCES INTERMEDIA INFRASTRUCTURE Biographical DictionaryWHO TimelinesTime Period DirectoryWHEN MapsGazetteer WHERE Syndetic StructureThesaurusWHAT Special Display ToolsAuthority ControlFacet Learners Dossiers

SLIDE 32ISGC Taipei, Taiwan “Lives” Acknowledgements Electronic Cultural Atlas Initiative project This work is being supported supported by the Institute of Museum and Library Services through a National Leadership Grant for Libraries Contact:

SLIDE 33ISGC Taipei, Taiwan Thank you! Available via