Next steps for BHL and Linked Data John Mignault Technical Advisory Group Biodiversity Heritage Library

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

The BHL way to content William Ulate BHL Technical Director Global BHL Coordinator Leiden, Netherlands February 14, 2013.
Information Analysis at Scale: HathiTrust Research Center Beth Plale Director, Data to Insight Center Co-Director, HathiTrust Research Center November.
Publish or perish? Linking Scratchpads and the new Biodiversity Data Journal for streamlining publication of botanical data D.N Koureas 1, L. Penev 2 &
IDENTIFIERS & THE DATA CITATION INDEX DISCOVERY, ACCESS, AND CITATION OF PUBLISHED RESEARCH DATA NIGEL ROBINSON 17 OCTOBER 2013.
FAO and UNESCO-IOC/IODE Combine Efforts in their Support of Open Access Written by Marc Goovaerts, U. Hasselt, BE.
Object Re-Use and Exchange Mellon Retreat, Nassau Inn, Princeton, NJ, March Herbert Van de Sompel, Carl Lagoze The OAI Object Re-Use & Exchange.
OpenUp! A New Project on Opening up the European Natural History Heritage for EUROPEANA W. G. Berendsohn, A. K. Michel, A. Güntsch, W.-H. Kusber (2011)
Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.
A LOOMING CRISIS: MAINTAINING ACCESS TO ELECTRONIC RESEARCH PRODUCTS Daphne Fautin University of Kansas Gail Kampmeier Illinois Natural History Survey.
Biodiversity Heritage Library by Connie Rinaldo. Overview History EOL/BHL: WHY? Members/Collaborators Process Governance Sustainability: Legal and Financial.
Cynthia Parr Species Pages Group GBIF Briefing 11 Aug 2010.
High Volume Production of Alternative Text: Supporting a Statewide System The Alternative Media Access Center.
Link yourself or perish? PhytoKeys, the next generation journal in systematic botany Lyubomir Penev 1, W. John Kress 2, Sandra Knapp 3, De-Zhu Li 4, Susanne.
THE DATA CITATION INDEX AN INNOVATIVE SOLUTION TO EASE THE DISCOVERY, USE AND ATTRIBUTION OF RESEARCH DATA MEGAN FORCE 22 FEBRUARY 2014.
IDENTIFYING OPEN ACCESS ARTICLES: VALID AND INVALID METHODS David Goodman Palmer School of Library and Information Science, Long Island University Kristin.
Literature in Theory & Practice Frederic Murray Assistant Professor MLIS, University of British Columbia BA, Political Science, University of Iowa Instructional.
Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.
The Pensoft Journal System and XML-based workflow Lyubomir Penev Life and Literature Conference, Chicago 2011 ViBRANT Virtual Biodversity.
Improving search in scanned documents: Looking for OCR mismatches David Morse David King Anton Dil Alistair Willis David Roberts Chris Lyal.
Research evaluation requirements José Manuel Barrueco Universitat de València (SPAIN) Servei de Biblioteques i Documentació May, 2011.
Tom Garnett April 12, 2007 Smithsonian Institution Libraries National Museum of Natural History Board Science Committee Meeting Biodiversity Heritage Library.
Technical Services and User Service Improvement Jie Huang & Katherine Wong University of Oklahoma Libraries U.S.A.
Million Book Bibliotheca Alexandrina Noha Adly 20 November 2006.
Breakouts. Penguins: Skunks: Cacti: Beetles: Classroom A - Suzanne Classroom C - Chris Lecture Hall 2 - Connie Ward Lecture Hall - Marie (Theme: Content.
Medical Heritage Library National Digital Stewardship Alliance New England Regional Meeting Kathryn Hammond Baker Lightning Round May 10, 2013.
Global BHL Meeting Fez - Morocco, May, 2013 BHL Brazil [and LA&C] via SciELO BHL Network Abel L. Packer SciELO, Coordinator Technical support: Fabiana.
1 Writeslike.us Em Tonkin, Andrew Hewson
The application of phenotype and environment ontologies to Natural History Collections Rutger Vos.
April 9, 2003Santiago, Chile The ISI Database: Reflecting the Best of International and Regional Research Keith R. MacGregor Sr. Vice President The Americas,
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
Challenges and Opportunities for Academic Libraries Collaborative Imperatives to Support Collections, Digital Initiatives, and New Services for a Changing.
Digitizing Aloha: Using Information Technology to Preserve and Present the History and Culture of Hawai'i Bob Schwarzwalder Assistant University Librarian,
Crowd-sourcing the creation of “articles” within the Biodiversity Heritage Library Bianca Crowley Trish Rose-Sandler
Building Digital Bridges Wellcome Arabic Manuscripts:
Insert title. Chicago, USA, November 2011 Kirstenbosch, Cape Town, June 2012 Pretoria National Botanic Garden, Pretoria, April 2013.
TDWG 2006 Conference, St Louis Digitizing the legacy literature of biodiversity An introduction to the Biodiversity Heritage Library (BHL) Neil Thomson.
Making search simpler The NHM Library and Archives Virtual Library Project.
The ResearcherID Project James Pringle VP Product Development Scientific and Scholarly Research Thomson Reuters Source: Science, March 28, 2009.
Deepcarbon.net Xiaogang (Marshall) Ma, Yu Chen, Han Wang, John Erickson, Patrick West, Peter Fox Tetherless World Constellation Rensselaer Polytechnic.
Biodiversity literature mark-up Compelling use cases for Natural History Collections Dr Dimitris Koureas Natural History Museum London Workshop on mark-up.
DuraCloud Open technologies and services for managing durable data in the cloud Michele Kimpton, CBO DuraSpace.
IDigBio: Addressing a BIO Big Data Challenge. A. Matsunaga, et al IEEE e-Science. 2013: How iDigBio is Different.
ONLINE SEARCH AND REDACTION SYSTEM Many concepts of digitalization which aim is to present datas on internet are faced with two main subjects and problems:
Literature & interoperability: a working example using ants Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson TDWG.
Dr. Patricia Mergen Biology Department Head of the Cyber-taxonomy and Biodiversity Information Unit Royal Museum For Central Africa (RMCA) Federal Scientific.
 digital methodologies for global media research Randy Kluver Dept of Communication Texas A&M University.
ELISQ Systems Demonstration Sagnik Ray Choudhury Doha -- May 2015.
Taxonomic Name Recognition (TNR) in Biodiversity Heritage Library (生物多样性图书馆分 类学名称识别) Qin Wei (魏琴), Chris Freeland, P. Bryan Heidorn Missouri Botanical.
PARTHENOS-project.eu EOSC market demand for art, humanties and cultural heritage Amsterdam– EGI Conference– 7/4/2016 Franco Niccolucci Scientific Coordinator,
Big Data: Every Word Managing Data Data Mining TerminologyData Collection CrowdsourcingSecurity & Validation Universal Translation Monolingual Dictionaries.
Biodiversity Heritage Library: A Successful Collaboration, A Fully Open Access Collection Marty Schlabach Mann Library, Cornell University Upstate New.
Pcdm, iiif, & interoperability esmé dplafest
Freeland, LAPI II, 18 NOV 2008 Digital Libraries for Science: Botanicus & Biodiversity Heritage Library Chris Freeland Director of Bioinformatics, Missouri.
World wide access to biodiversity literature The Biodiversity Heritage Library Henning Scholz 1 & Tom Garnett 2 1 Museum für Naturkunde, Berlin, Germany.
Where to find online information
What are our collections being used for?
Open Research Data and Open Access publications: How do they sit in the Web of Science? Guillaume Rivalle, Manager, Europe solution specialists
Component 4: The Independent Investigation
LECTURE 3: DATABASE SEARCHING PRINCIPLES
Gain Global Exposure: Partner with EBSCO to Promote your Scholarship
International Congress of Entomology, Orlando
The High Energy Physics information platform: Introduction
Introduction of KNS55 Platform
Dr. Patricia Mergen Biology Department
Introduction to Historical Texts
International Medieval Bibliography
Bird of Feather Session
Significant Digitization Projects
AUC’s Role In Facilitating Access To Knowledge In The Arab World
Presentation transcript:

Next steps for BHL and Linked Data John Mignault Technical Advisory Group Biodiversity Heritage Library

The Biodiversity Heritage Library BHL is a consortium of natural history, botanical libraries and research institutions An open access digital library for legacy biodiversity literature An open data repository of taxonomic names and bibliographic information An increasingly global effort – US/UK, Europe, Egypt, China, Africa

How much text are we talking? Just hit 40 million page mark Tens of thousands of titles 110, 000 volumes Internet Archive is BHL scanning partner In conjunction with local scanning efforts

Issues we’ve faced OCR is a *BIG* deal A lot of literature is pre-1923 Expanding the range of material in BHL

OCR is a *BIG* deal All book / literature digitization projects affected, not just BHL Especially problematic in BHL – More than 50 languages represented in BHL – Dates of publication from 1400’s to 2000’s – Irregular typeface / typesetting – Multiple languages on one page Botanical descriptions in Latin

2007 Name Finding Study >35% OCR error rate for names only 1Insert Space8n->v 2Omit Space9l->i 3e->c10r->i 4u->I11u->ii 5u->n12h->l 6i->l13h->ii 7c->e14e->o Top OCR errors 35.16% Of the 3,003 names, 1,056 were incorrectly transcribed by OCR. Wei, et al. An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library. Proceedings of TDWG

Abbild ungen und Beschreibungen der Fische Syriens, nebst einer neuen Classification und Characteristik sämmtlicher Gattungen der i JOH. JAKOB HECKEL, Inipectoi am k. k. Hof-Natur.-iUenkabinete in Wien, mehr, yelelirt. UeHtllMeii. MIfglivd. STUTTGART. E. Schweizerbart' sehe Verlagshandlung, 1843.

Older material Great deal of material is pre-1923 Irregular fonts – blackletter Multiple languages on same page – English text with Latin scientific names Changes in geographic names Changes in scientific names

*E.xvi � c � piteI von c. cXx.WptdvonfnrWmn bu � fbe;bcn.5 am cix bIa � S &3rn~ 41X a � m cv(f b1air � 'o � et ert oiensr � ; � ', : � hlrfc � c wa ff � 4am.diug bist a 6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn � ciblatGteaM w ?ffoaifrn w4wmeu nu weib e, wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl � Oiff ;Bruet wacfttc n qmcx b1a bl: bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t � B Rn " � trv W1Rt' ?Cm c blas waIwutr Ober � ci ti 1V Ces ' wt gbtiemwwajfu tpctt, afferain 9 c: b � titbfof � r f eran m rs bra wlg auig4;f aer � m *mc vrt blatcabtfm wfru an'deg~m rt blas Iaum bwWt � run f ncmai b14ianf tJobrrfan ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W � e � &mcyfbq4 Mabtt mmw rc a iiu bc Jcn ncI.end.*, blat s. a\ u: � rprd3 rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i

Expanding scope Manuscripts, field notebooks –mostly handwritten, often with drawings Global expansion means dealing with non- Western script systems and a whole new set of OCR problems – Arabic materials from Bibliotheca Alexandria in Egypt

Images

Some current initiatives Scientific name extraction “Parts” PDF Generator

Scientific Name Extraction TaxonFinder algorithm in production since 2008 – More than 100 million candidate name strings – More than 1.5 million unique, verified names – Available through UI, APIs, Data Exports & Internet Archive New collaboration with Global Names – Improved algorithm, better precision & recall – More data!

Finding parts Disambiguating and locating structural boundaries in the corpus Done mainly by crowdsourced means – Citebank Greatly increases usability and semantic value of the dataset Addressing important – makes data addressable and thus linkable

Articles in the BHL UI

Images

PDF Generator

What we’d like to do Correcting OCR Rekeying Tables of Contents Researching candidate Scientific Names Image identification & extraction – – Currently funded by NEH ^Challenges framed as games

We need your help “When in doubt, use humans.” ttp://radar.oreilly.com/2012/07/data- jujitsu.htmlttp://radar.oreilly.com/2012/07/data- jujitsu.html Increase value of biodiversity domain through improved data integration Many similarities between specimen labels and literature

Need deep intertwingling Wider integration of biodiversity data Normalization through controlled vocabularies and authorities Linkages between – Specimens – Descriptions – Articles – Manuscripts

To sum up BHL is a massive dataset useful for multidisciplinary research – Systematics – Natural Language Processing – Humanities BHL is open – Free to use at – Open access data for scholarly use & reuse BHL has APIs and data exports to enable reuse – BHL data can be incorporated into other virtual research environments

Get involved Thanks!