Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

Similar presentations


Presentation on theme: "The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo."— Presentation transcript:

1 The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo University of Illinois at Urbana--Champaign Grainger Engineering Library Information Center

2 Overview Testbed Goals & Mission. Testbed Issues. Testbed Technologies. SGML Processing Methodology. Accomplishments. Transaction Log Analysis Federation Tests & Distributed Repository Model. Future Foci. What We Have Learned. Questions

3 “The Business of a University is Information…The Production and Dissemination of Information is the Work of the University.” Tom Everhart, President, California Institute of Technology

4 Digital Library Initiative Program Funded by National Science Foundation (NSF), DARPA, and NASA. Awarded grants to 6 universities (and partners), September 1994--August 1998. The 6: Illinois, Michigan, Stanford, Berkeley, Carnegie Mellon, Santa Barbara. Each project: $4 million over 4 year project. Illinois: Testbed, Research, Evaluation, Web Software.

5 Scholarship, Publishing, Libraries Changing Paradigm: Authors, Publishers, Libraries, A & I Services. Scholarly Publishing Issues (We Pay Twice). Publisher Costs (85% for First Copy). Idea of Universities as Publishers. Users’ Information Seeking Behavior (personal collection, colleagues, e-mail, Web, Library). Archiving Issues (Depository idea GB, Canada) Role of the Library (Function as well as Place).

6 Scholarship “The normal mode of scientific growth is exponential…(we are) entering a period of crisis marked by rapidly increasing concern over problems of manpower, literature, and expenditure that demand solution by reorganization.” –Derek de Solla Price, 1986. Year and Number of Journals: –16651 –1932 6,000 –1981 96,000 –1996 165,000 Avg. Price of U.S. Periodical rose 155%, 1986-96.

7 Testbed Goals & Objectives Construct Large-Scale, Multipublisher, SGML-Based Full-Text Testbed. Investigate Processing, Indexing, Normalization, Retrieval and Rendering. Study End-User Searching Behavior and Needs. Look at One-Stop-Shopping Retrieval Models (Integration of Services). Identify Models for Effective Retrieval in Electronic Full-Text Publishing Environment.

8 Testbed: 54 Journals, 39K Articles All items in SGML & 2/3 in PDF American Institute of Physics--APL, JAP, RSI –12,000 articles, 1995--, weekly updates. American Physical Society--PRL –8,800 articles, 1995--, weekly updates. ASCE Journals (25 titles) –5,000 articles, 1995--. IEE Proceedings and Electronics Letters –7,400 articles, 1993--. IEEE Computer Society (14 titles): 5,000 articles, 1996--.

9 Issues Toward the Holy Grail of Smart Document. Top Menu Integration and Cross-Resource Links. Searching over Full-Text of Journals vs. Abstract & Index Service Database. Full-Text Display (Mathematics Rendering: SGML, HTML, PDF, XML, Math ML, TeX.). Web-Based Problems & Connectivity. Breadth and Depth of Collections. User Response.

10 Testbed Technologies Open Text (HPUX) Search Engine / LiveLink Web. Item Metadata for Normalization and Short-Entry Display. TCP/IP and HTTP for Full-Text, DCOM DLLs for A&I Links, Java Applets (Wordwheels). SGML rendering via Panorama. Custom Processing Programs on NT and Unix Platforms (Visual Basic, C++, Perl). Microsoft IIS (Web Retrieval, ASP for Links and Top Menu, Authentication w/ Bluestem).

11

12 Accomplishments (Overview) Distributed Repository Model (within Testbed & with AIP). Process & Retrieve from Multiple Publishers & Heterogeneous DTDs. Use of Aliasing (Normalization) for Cross- Repository Access from Single Client Search Argument. Item Metadata Definition. Dynamic Linking of Resources and Proxy A&I Service Access from / to Testbed. Focused User Studies.

13 UIUC DLI Testbed Architectures Under Investigation Repositories (SGML, PDF) Metadata Indexes Gateways Clients Testbed Links to: A & I Services, Other Full Text IEE IEEE CS APS ASCE AIP Urbana New York HTTP JAVA ASP LiveLink Authentication Authorization

14 DeLIver Features Retrieval over Subset of Repositories. Forward (Citation) & Backward (Bibliography) Links to Testbed. Links to INSPEC, Compendex, Current Contents from Items & Bibliography. Ovid INSPEC/Compendex Proxy. Integration with Other Library Resources Web-Kerberos Based Authentication. Capability of Digital Signing. User Transaction Logs.

15 Toplevel Menu Transactions (Total 19738)

16 Transaction Logs (1) 4035 total end-user sessions (September through May). 3023 end-user sessions where searches were performed Top Bar# SessionsTotal # About DeLIver 427 536 Browse (all) 15852277 Browse Only 1012 Help 175 190 Quicktips 189 245 Download Software10011086 Other Resources 230 289

17 Transaction Logs (2) 4035 total end-user sessions (September through May). 3023 end-user sessions where searches were performed Search Fields # SessionsTotal # Keyword20836090 Abstract194747 Article Title368976 Article Author377 926 All Author185 468 Citations 39 74 Body of Article76 336 Figure Caption 26 60 Table Caption9 12 Journal Title218 530 Title, Headings, Caption118358

18 Transaction Logs (3) 4035 total end-user sessions (September thru May). 3023 end-user sessions where searches were performed. Searching Characteristics# SessionsTotal # Average Length of Search 727 seconds Display Full-Text20794267 PDFs84210104 SGMs15164660 Extended Citation5782212 Boolean Operators8565773 ANDS682 Ors204668 NOTs3079 KWIC Display389780 Links to Inspec/Compendex261404 Multiword Search Arguments18486134

19 Transaction Logs (4) 4055 end-user sessions (September thru May) 3023 end-user sessions where searches were performed Publisher Choices# SessionsTotal # All Publishers25359185 AIP65238 APS3384 ASCE96247 IEE3898

20 Transaction Logs (5) 4055 end-user sessions (September thru May) 3023 end-user sessions where searches were performed Points: Not much use of Help or Quicktips; a lot of Browsing but < 50% of search sessions; Not jumping to A&I Services from DeLIver; mostly Keyword Searching, also fair amount of Author, Article Title, Journal Title; much more Display Full-Text than Extended Citation (why?); 25% of sessions use Boolean operators; Multiword Search Arguments (complex terms, not single words) being entered; Linking to INSPEC/Compendex in 20% of sessions; predominantly All Publishers being searched.

21 Testbed User Authentication Approach: –Authenticate Once per Session / Authorize per Use Current Mechanism: –On 1st Request, User Referred to Bluestem Script –Upon Bluestem Authentication: Authorization Record Written to SQL Database Cookie Set Which Points to that Record Need to Fix Redirection Problem with MS IE Need to Extend Outside Cookie-Setting Domain

22 Future Work Implementation of Distributed Repository Model. Expand Breadth of Testbed (Loading Locally and Linking to other Repositories). Use of Digital Object Identifiers and other Standards. Rendering via HTML 4.0 & CSS, XML & XSL. Adding Dynamic retrieval Mechanisms (Wordwheels, Co-Occurrence Matrices). Expand Simultaneous Search Mechanisms. Expanded User Studies.

23 SGML vs. HTML vs. XML SGML: –Supports Powerful Indexing, Search & Retrieval –But Client, Delivery, & Rendering Issues Remain HTML: –Ubiquitous; Rendering Has Become More Robust –But Remains Presentation Oriented, Less Semantic XML: –Subset Retains SGML Features of Primary Interest –But XML Is New, Untested, Under-Supported

24 Converting DLI Testbed to XML XML Differences from SGML: –No SHORTREF (Tag Minimization) –Tags Are Case Sensitive –Restrictions on Entities, Attributes, Link Mechanisms –Empty Tags Handled Differently Math ML vs. ISO 12083 Math –Math ML a Major Departure -- Adds Semantics –Focus on Java / ActiveX for Initial Deployment; Long- Term Success May Hinge on XSL / DSSSL ‘Content-Markup’ requires XSL, Dynamic HTML functionality

25 CSS, XSL, DSSSL CCS1 & CCS2 Have Added: –Overlapping Glyphs, Absolute & Relative Positioning –Downloadable Fonts (Platform, Browser Variable) –Styling by Attributes, 2 Levels of Hierarchy XSL, DSSSL, DSSSL-O: –XSL Uses XML Notation, Is Extensible (ECMAScript) –Allows More Extensive Manipulation In Formatting Supports Re-arrangement, Navigator Frames, etc. –Not Yet Implemented in Production Browsers

26 What We Have Learned (1) Power of SGML for Indexing & Retrieval. Problems with rendering mathematics--SGML, TeX, HTML, XML, Math ML. Depth and breadth of collection (TULIP/ Red Sage Syndrome; note use of Ovid client). Local Processing Implications Metadata needs and robustness of Distributed Model.

27 What We Have Learned (2) Efficacy of Full-Text (stand-alone, integrated with A & I, part of TOC Service). The Idea of a Digital Library in the Digital Chaos--the role of the Gateway and Linking of Resources. Changing roles of Authors, Publishers, A & I Services, Libraries. These Technologies Will Transfer to the Web (CSS I & II, HTML 4.0, Dynamic HTML, XML).


Download ppt "The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo."

Similar presentations


Ads by Google