Presentation is loading. Please wait.

Presentation is loading. Please wait.

Next steps for BHL and Linked Data John Mignault Technical Advisory Group Biodiversity Heritage Library

Similar presentations


Presentation on theme: "Next steps for BHL and Linked Data John Mignault Technical Advisory Group Biodiversity Heritage Library"— Presentation transcript:

1 Next steps for BHL and Linked Data John Mignault Technical Advisory Group Biodiversity Heritage Library Twitter: @jmignault

2 The Biodiversity Heritage Library BHL is a consortium of natural history, botanical libraries and research institutions An open access digital library for legacy biodiversity literature An open data repository of taxonomic names and bibliographic information An increasingly global effort – US/UK, Europe, Egypt, China, Africa

3 How much text are we talking? Just hit 40 million page mark Tens of thousands of titles 110, 000 volumes Internet Archive is BHL scanning partner In conjunction with local scanning efforts

4 Issues we’ve faced OCR is a *BIG* deal A lot of literature is pre-1923 Expanding the range of material in BHL

5 OCR is a *BIG* deal All book / literature digitization projects affected, not just BHL Especially problematic in BHL – More than 50 languages represented in BHL – Dates of publication from 1400’s to 2000’s – Irregular typeface / typesetting – Multiple languages on one page Botanical descriptions in Latin

6 2007 Name Finding Study >35% OCR error rate for names only 1Insert Space8n->v 2Omit Space9l->i 3e->c10r->i 4u->I11u->ii 5u->n12h->l 6i->l13h->ii 7c->e14e->o Top OCR errors 35.16% Of the 3,003 names, 1,056 were incorrectly transcribed by OCR. Wei, et al. An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library. Proceedings of TDWG. 2008. http://www.tdwg.org/proceedings/article/view/380

7 Abbild ungen und Beschreibungen der Fische Syriens, nebst einer neuen Classification und Characteristik sämmtlicher Gattungen der i JOH. JAKOB HECKEL, Inipectoi am k. k. Hof-Natur.-iUenkabinete in Wien, mehr, yelelirt. UeHtllMeii. MIfglivd. STUTTGART. E. Schweizerbart' sehe Verlagshandlung, 1843.

8 Older material Great deal of material is pre-1923 Irregular fonts – blackletter Multiple languages on same page – English text with Latin scientific names Changes in geographic names Changes in scientific names

9 *E.xvi � c � piteI von c. cXx.WptdvonfnrWmn bu � fbe;bcn.5 am cix bIa � S &3rn~ 41X a � m cv(f b1air � 'o � et ert oiensr � ; � ', : � hlrfc � c wa ff � 4am.diug bist a 6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn � ciblatGteaM w ?ffoaifrn w4wmeu nu weib e, wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl � Oiff ;Bruet wacfttc n qmcx b1a bl: bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t � B Rn " � trv W1Rt' ?Cm c blas waIwutr Ober � ci ti 1V Ces ' wt gbtiemwwajfu tpctt, afferain 9 c: b � titbfof � r f eran m rs bra wlg auig4;f aer � m *mc vrt blatcabtfm wfru an'deg~m rt blas Iaum bwWt � run f ncmai b14ianf tJobrrfan ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W � e � &mcyfbq4 Mabtt mmw rc a iiu bc Jcn ncI.end.*, blat s. a\ u: � rprd3 rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i

10 Expanding scope Manuscripts, field notebooks –mostly handwritten, often with drawings Global expansion means dealing with non- Western script systems and a whole new set of OCR problems – Arabic materials from Bibliotheca Alexandria in Egypt

11 Images

12 Some current initiatives Scientific name extraction “Parts” PDF Generator

13 Scientific Name Extraction TaxonFinder algorithm in production since 2008 – More than 100 million candidate name strings – More than 1.5 million unique, verified names – Available through UI, APIs, Data Exports & Internet Archive New collaboration with Global Names – Improved algorithm, better precision & recall – More data!

14 Finding parts Disambiguating and locating structural boundaries in the corpus Done mainly by crowdsourced means – Citebank Greatly increases usability and semantic value of the dataset Addressing important – makes data addressable and thus linkable

15 Articles in the BHL UI

16 Images

17 PDF Generator

18 What we’d like to do http://biodivlib.wikispaces.com/BHL+and+Gaming Correcting OCR Rekeying Tables of Contents Researching candidate Scientific Names Image identification & extraction – http://biodivlib.wikispaces.com/Art+of+Life – Currently funded by NEH ^Challenges framed as games

19 We need your help “When in doubt, use humans.” – @dpatil: ttp://radar.oreilly.com/2012/07/data- jujitsu.htmlttp://radar.oreilly.com/2012/07/data- jujitsu.html Increase value of biodiversity domain through improved data integration Many similarities between specimen labels and literature

20 Need deep intertwingling Wider integration of biodiversity data Normalization through controlled vocabularies and authorities Linkages between – Specimens – Descriptions – Articles – Manuscripts

21 To sum up BHL is a massive dataset useful for multidisciplinary research – Systematics – Natural Language Processing – Humanities BHL is open – Free to use at http://biodiversitylibrary.orghttp://biodiversitylibrary.org – Open access data for scholarly use & reuse BHL has APIs and data exports to enable reuse – BHL data can be incorporated into other virtual research environments

22 Get involved http://biodiversitylibrary.org http://biodivlib.wikispaces.com/Developer+Tools+and+API http://biodivlib.wikispaces.com/BHL+and+Gaming Thanks!


Download ppt "Next steps for BHL and Linked Data John Mignault Technical Advisory Group Biodiversity Heritage Library"

Similar presentations


Ads by Google