Presentation is loading. Please wait.

Presentation is loading. Please wait.

Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec.

Similar presentations


Presentation on theme: "Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec."— Presentation transcript:

1 Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec 2010

2 Contents INSPIRE Invenio software Demo INSPIRE use cases in D4Science II –OCR –Full-text indexing Current research: –Semantic Search –Author Disambiguation

3 3 run by (2007-)

4 4 HEP community community 20-30k active researchers publishing 10k articles/year large collaborations (up to 5000 members) very international (even small author groups) authors = readers rapid information exchange essential mailing of preprints since the 60s long OA tradition >90% of HEP journal articles on arXiv! dominance of community based information systems arXiv SPIRES

5 Dominance of community services 5 From 2007 survey of 2,000 physicists. Gentil-Beccot et al, Information Resources in High-Energy Physics: Surveying the Present Landscape and Charting the Future Course. J.Am.Soc.Inf.Sci.60:150-160,2009 arXiv:0804.2701

6 SPIRES (1974-) 6 network of databases HEP literature, conferences, institutions, experiments, hepnames, jobs SLAC – DESY – Fermilab Collaboration SPIRES-HEP Metadata for 850k objects, ~800 new records per week Preprints, journal articles, conference contributions, books, grey literature since 1974, web server since 1991 100k searches/day high data quality, manually curated, comprehensive coverage high acceptance, user involvement But: outdated technology from the 70s

7 7 Invenio (2002-) digital multimedia library system platform for CERN Document Server (CDS) powerful search engine Google-like speed for up to 5M records combined metadata, reference and fulltext search flexible metadata (MARCXML) personalization and collaborative features modular architecture Apache/Python/MySQL GNU General Public Licence - invenio-software.org ~30 instances worldwide

8 8 ingestion

9 9 dissemination

10 10 INSPIRE development 2007: feasibility study 2008: user-level functionalities data conversion citation analysis, search syntax, output formats… 2009: cataloguing functionalities metadata maintenance and enrichment tools 2010: workflow harvesting, cataloguing… April 2010: public beta version http://inspirebeta.net http://inspirebeta.net

11 11 Bibliographic Content SPIRES content (plus part of CDS): journal articles, conference proceedings, preprints, experimental notes, theses going beyond SPIRES: conference slides, multimedia, software, high-level research data… going back before 1974 more material from neighboring disciplines astrophysics, nuclear physics, mathematics… cited by core HEP articles

12 12 Rich Content Repository all freely accessible articles esp. endangered material access restricted articles hidden archive first agreements with Springer and APS historical material scanning of old preprint series beyond articles slides, multimedia, software, wikis… independent citable objects

13 13 INSPIRE features I Advanced search functionality Google-like freetext search Complex second-order searches Example: Find the most influential HEP core papers that cite the Hitchin article Generalized Calabi-Yau manifolds but dont cite any papers by Polchinski refersto:reportnumber:math/0209099 collection:core cited:100->9999 NOT refersto:author:Polchinski

14 14 INSPIRE features II detailed record pages abstract, keywords, references, citations, fulltext, figures various export formats comprehensive author pages affiliation history, coauthors, frequent keywords, article classification, citation summary citation analysis cited by, co-cited with, self-citations, citation history taxonomy based classification

15 Newest Features Plot extraction Plot caption indexing Full-text snippets 15

16 16 HEP taxonomy Hierarchical structure of all important HEP concepts Providing (e.g. dynamical symmetry breaking) : synonyms (dynamically broken) related terms (spontaneous symmetry breaking) broader/narrower (symmetry breaking) definitions subject areas (high-energy physics – theory)

17 17 Taxonomy applications fast automatic generation of keywords enabling e.g. prompt alerts manually curated afterwards automatic selection of HEP relevant articles helps minimizing manual selection improved search algorithm (planned) e.g. search for SUSY will also find supersymmetry narrow/broaden search user tagging (planned) improve Inspire generated classification improve taxonomy

18 18 Author identification INSPIRE author id Single Sign-on for INSPIRE and arXiv.org active participation in ORCID author disambiguation using e.g. lab ids, affiliation history, coauthors and more 22.000 INSPIRE-ids already assigned automatic association of papers with authors using info on affiliations, coauthors, research topics, from publishers G. Chen: 963 docs, 21 real authors, only 22 docs not assigned, 97.2% success rate INSPIRE-id part of author lists of large collaborations

19 19 Coming soon… personalization personal accounts, "bookshelves, display formats, e-mail alerts, RSS feeds collaborative tools, user groups claim my papers user tagging user submission paper centric (articles, supplementary material) and beyond

20 20 coming later innovative metrics semantic analysis recommender systems –combining citations, keywords, fulltext, usage pattern data... open API for 3rd party tools and searching object aggregation (OAI-ORE) OAIS standards for long-term document preservation

21 21 Partnerships Researchers - community user tagging, user submission improved correction interfaces feedback driving future developments information providers close alliance with arXiv data exchange with publishers/databases standardized author identities neighboring fields OAI – PMH (harvesting) and OpenSearch ADS (SAO/NASA Astrophysics Data System)

22 Demo 22

23 INSPIRE use cases in D4Science II OCR Fulltext indexing Bibliometric analysis Data mirroring 23

24 Process Execution Engine Process Execution Engine (PE2ng) is a system to manage the execution of programs in a distributed infrastructure PE2ng provides adaptors for gCube and gLite based GRID, as well as Condor and Hadoop 24

25 MapReduce in PE2ng D4Science introduced a support for MapReduce (Hadoop) jobs Advantages Fault tolerance - mechanisms of fault detection and automatic job resubmission in case of software or hardware errors. Load balancing - optimizing the completion time of processing the entire data set (finishing time of the last subjob) Combined Storage & CPU resources On demand execution mode (vs. batch queues) It perfectly matches the needs of I/O intensive INSPIRE jobs 25

26 Optical Character Recognition (OCR) Using open source Ocropus software, well suited for large scale batch processing Ocropus has three steps: document layout analysis, line recognition, character identification (different models) Input typically PDF files, OCR output in hOCR format (HTML), can be converted to PDF Several languages: EN, FR, DE... Automatic procedure developed (in python) for format conversions, text recognition, hOCR->pdf conversion, grid job submission The whole OCR process can be done as a grid job Scanning and OCRing typically done using commercial services and tools (e.g. in India), OCRing part more expensive=>can save real money. Interest from other institutions 26

27 Full-text Indexing About 1 million INSPIRE documents need full-text search combined with the Invenio meta data search Technologies used: Lucene, Hadoop, BibIndex (Invenio internal engine) D4Science services (Process Execution Engine, Hadoop) Integration with the powerful Invenio meta data search Future extension: semantic indexing using High Energy Physics taxonomy 27

28 Semantic Search (Roman Chyla) Lucene + Seman Seman is used for enriching the textual content with added, semantic features. There are 3 crucial components involved: –1) NLP engine (GATE) –2) workflow engine –3) translation engine https://svnweb.cern.ch/trac/rcarepo/wiki/InspireSemant icSearch#Fulltextsearchwithsemanticfeatures 28

29 Semantic search/indexing

30 Basic ideas NLP processing helps disambiguation HEP taxonomy bridges gaps between terms Translation into limited sets of codes facilitates operations over sets –For example, we can compute similarity of concepts X-Y using overlap of codes Text is indexing together with semantic tags Special processing applied both at indexing and at the query time 30

31 Author Disambiguation (Henning Weiler) BibAuthor is a framework around an algorithm meant to solve author ambiguity based solely on the meta data available for documents. It relies on the database of INSPIRE which is kept in the MARC21 format. 31

32 Algorithm Before any algorithm can run, one step of preparation is to be done: –Read all names from the Inspire database and store them in a new table The algorithm itself is divided into several steps: –1. Clustering--finding potentially related authors –2. Matching--create RA entities by pairwise comparison of VAs within a cluster –3. Post-matching comparison--identifying identical RA entities through cross-cluster comparison 32

33 Thank you! Questions? 33


Download ppt "Global Digital Library in High-Energy Physics Jan Iwaszkiewicz (CERN) Digital Repositories – Linked Open Data – the possible Role of D4Science 16 th Dec."

Similar presentations


Ads by Google