Presentation is loading. Please wait.

Presentation is loading. Please wait.

Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Similar presentations


Presentation on theme: "Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries."— Presentation transcript:

1 Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper A brief intro to machine learning & data science for Libraries

2 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Context Narrative Story telling The Library's story, and the Archives story, but also…

3 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Users’ stories Scholars' stories Adding context through recombinant metadata

4 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Scholars & Users Stories – Tim Sherratt Also:

5 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Library Authority Data “Include links to other URIs. so that they can discover more things.” Short of providing and linking to URIs, this *is* authority data. This is what our authority files are for.

6 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Linked data is about context authorities provide context and yet our controlled vocabs are nearly gone because the interfaces to them were broken

7 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

8

9 The Death of Browse Next-Gen Discovery Systems don't make use of Authority Control “Browse” was/is broken as a UI Design Rich data in Authorities, disconnected from narrative, context, search Richer “Authority” type data outside libraries... “Next Gen Next Gen Discovery…

10 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

11

12

13

14 Fuzzy Wuzzy – Seat Geek Fuzzy Wuzzy – Awesome Library from SeatGeek https://github.com/seatgeek/fuzzywuzzy

15 Slide courtesy of Doug Oard Univ. of Maryland

16 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Tools - Natural Language Processing DBPedia Spotlight https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki Zemanta: Open Calais: Open Refine: DataTXT: https://dandelion.eu/products/datatxt/https://dandelion.eu/products/datatxt/ AlchemyAPI: FuzzyWuzzy: https://github.com/seatgeek/fuzzywuzzyhttps://github.com/seatgeek/fuzzywuzzy

17 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

18

19 Where does this lead? We need new interfaces new tools for new kind of catalogers for knowledge organization experts

20 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Linked Jazz Back End

21 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Primo PNX and Authorities Indexing Cross References New Browse Functionality Authority Control from Aleph / Alma What about non-MARC, or non- Aleph Data? Matching Strings to Authorities

22 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Enter Open Refine

23 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Match strings to vocabularies…

24 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Like LCNAF…

25 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Or Wikipedia

26 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Automated Authority Control?

27 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

28 Open Refine RDF Skeleton

29 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

30 Proposed System Architecture

31 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Hydra Modeling & Architecture Approaches to Provenance Prov-O Named Graphs Named Datastreams “n” nyucore “records” Same properties defined for each Keep data sources separate Merge for display in Blacklight & export to Primo

32 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Separate Metadata Datastreams source_metadata, enrich_metadata Reload one or both without affecting other or native metadata native_metadata Edited only through Hydra UI Partitioned from external sources

33 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Metadata Provenance

34 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Fedora Datastreams

35 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Blacklight User Interface

36 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Where does this lead? We need new interfaces new tools for new kind of catalogers for knowledge organization experts

37 A Role for Ex Libris Alma &/or Primo Named Entity Recognition Vocabulary Reconciliation Provenance Management Primo Central Named Entity Recognition on Full Text Auto Classification

38 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 A bit louder... we need new interfaces we need enterprise tools Integrated into our metadata management systems for new kind of catalogers for knowledge organization experts

39 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Simplified Workflow Proposal

40 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 More Tools – At Programming Level Open NLP: https://opennlp.apache.org/https://opennlp.apache.org/ Stanford Natural Language Toolkit: Python Tools SciKitLearn, Pandas, NLTK, SciPi, NumPi https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience

41 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 More Data Science-ey Tools

42 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Data Science Techniques Feature Extraction / Feature Engineering Predictive Modeling Probabilistic Classification – Large Multi-Class Problems Text Analytics Vectorization Bags & Sets of Words TF/IDF N-Grams Sparse Matrices

43 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Simple Example – Predict Yelp Star Ratings

44 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Fitting a Model – Naïve Bayes

45 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Data Science Venn Diagram

46 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

47 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Where can we go from here? NER is just the beginning Feature Engineering Hiring Statisticians Clustering & Classification Vocabulary Pruning and Engineering Manageable 10-20k Class Text Classification Problems Domain Specific Ex Libris’ Activity in this space

48 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Thanks!


Download ppt "Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries."

Similar presentations


Ads by Google