Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Analysis at Scale: HathiTrust Research Center Beth Plale Director, Data to Insight Center Co-Director, HathiTrust Research Center November.

Similar presentations


Presentation on theme: "Information Analysis at Scale: HathiTrust Research Center Beth Plale Director, Data to Insight Center Co-Director, HathiTrust Research Center November."— Presentation transcript:

1 Information Analysis at Scale: HathiTrust Research Center Beth Plale Director, Data to Insight Center Co-Director, HathiTrust Research Center November 12, 2012

2 HathiTrust Objective: Contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge. Launched in October 2008 – University of Michigan – Indiana University Expanded to include content from – CIC Member Libraries – UC System Libraries – Internet Archive – Now includes more than 70 partner institutions 2

3 HathiTrust Research Center Mission: provision of computational access to comprehensive body of published works for scholarship and education. HTRC is founded as a joint venture between Indiana University and the University of Illinois Urbana-Champaign, aimed at solving the difficult challenges of increasing computational access to the public domain and copyrighted material in HathiTrust. 3

4 HTRC Long-term goals Support innovation in Cyberinfrastructure to deliver optimal access and use of HathiTrust corpus. Implement “Non-consumptive” research: a technical and intellectual challenge Identify and host existing data analysis, text mining and retrieval tools that are of interest to the community. Stimulate development of new analytical methods and tools. We hope that the scale of the HTRC will promote new levels of collaboration in tool development. 4

5 The HTRC Collection Public Domain Materials of the HathiTrust – 2,592,097 Volumes – Data Size 2.3 TB in raw OCR’d text 3.7 TB of managed OCR’d text 1.85 TB Solr Index – Monthly Updates And irregular data delete requests 5

6 Primary Functions To Support Data discovery – Searching – Creating and saving collections Service Discovery – Algorithms, analyses, and processes – Parameters and outputs Data storage and retrieval Computational resources APIs for controlled access to data and services 6

7 Experiment: Large Scale Data Analysis on XSEDE

8 Experimental Environment and Results Dataset 2,592,210 volumes, in total 2.1 TB, divided into 1024 partitions of 2GB each Computation platform XSEDE Blacklight, 1024-core of each 2.27 GHz, 8192 GB memory. Each core processes one partition Results Whole corpus word count finished in 1,454 seconds or minutes

9 Computation Time Distribution

10 Word Frequency Distribution

11 HTRC UnCamp Sept 2012

12 DEMO 12

13 13

14 Issues in Developing Research Collections 14

15 15

16 16

17 17

18 18

19 19

20 20

21 Issues in Developing Research Collections Search Methods – Known item search via fielded bibliographic data Title Author Standard number – Sparsely populated data – Full text access All words in OCR’d text All words in bibliographic data 21

22 OCR Errors in Public Domain Corpus Kinds of errors – Hyphens – Rules and Dictionaries – Pagination – Multiple language texts – Page headers, footers, marginalia – Non-text elements – Tables and highly structured document elements – Imbedded images – Scientific notations – Book design decorative elements 22

23 Automatic Text Correction 99,427 human expert verified rules were applied 256,000 volumes – 70 million pages – 270 words per page – 19 billion words – 125 million words required correction Total percentage of words needing correction was 0.65%. Probability that any volume has one or more errors was 84.9%. 23

24 Automatic Correction Details Average hit rate per page = Average hit rate per word = Total Bytes Read = 121 GB Total Execution Time = 3.80 hrs 24

25 Problem Examples 25 Hyphens: 1.Combining words 2.Crossing pages 3.Page headers

26 Issues with Text Original A D °GEO III C xxix 653 the faicl Commifiioners or Truftees or other Perfons as aforefaid to iffue their Preceptor Precepts to the Sheriff or Sheriffs or Bailiff or other proper Officer of the City Borough or County wherein fuch parochial or other Diftrift hall be fituate Rules-based Correction A D °GEO III C xxix 653 the faicl Commifiioners or #trustees# or other #persons# as #aforesaid# to #issue# their Preceptor Precepts to the Sheriff or Sheriffs or Bailiff or other proper Officer of the City Borough or County wherein #such# parochial or other #district# hall be #situate# 26

27 HTRC Research Sandbox A small subset of the collection – Bibliographic data (in MarcXML) – Images – OCR text 250,000 non-Google digitized public domain volumes (or the 35,000 IU digitized public domain collection) Image training sets – Clean page images and text – Bad page images and text 27

28 Metadata as Research Object An IU researcher is interested in publishing trends from 1700 – – Publication dates – Publishers – Publication location A research team from UWashington is interested in ontology developments – Subject headings – Terms in notes and other fields 28

29 Possible Metadata Research Using the full text and images to identify – Content metadata Topics, keywords, and subjects Genre Publication data Additional characteristics for content identification and author demographics – Structural component identification Chapters (first page of each chapter) Index Preface Table of Contents Genre as structure 29

30 Possible Text Research OCR correction algorithms – Era based rules Dictionary Hyphenation – Page awareness Page numbers Running heading Cross page hyphenation issues New OCR algorithms – Era based rules – Identification of non-text elements – Page awareness 30

31 Visit for more informationhttp://www.hathitrust.org/htrc 31


Download ppt "Information Analysis at Scale: HathiTrust Research Center Beth Plale Director, Data to Insight Center Co-Director, HathiTrust Research Center November."

Similar presentations


Ads by Google