Presentation is loading. Please wait.

Presentation is loading. Please wait.

XSL, Swish-e and DjVu Kevin Reiss Rutgers-Newark School of Law Library March 10 th, 2004 TAG Meeting.

Similar presentations


Presentation on theme: "XSL, Swish-e and DjVu Kevin Reiss Rutgers-Newark School of Law Library March 10 th, 2004 TAG Meeting."— Presentation transcript:

1 XSL, Swish-e and DjVu Kevin Reiss Rutgers-Newark School of Law Library March 10 th, 2004 TAG Meeting

2 Project Description: New Jersey Digital Legal Library url: http://njlegallib.rutgers.edu url: http://njlegallib.rutgers.eduhttp://njlegallib.rutgers.edu Create a searchable & browsable repository of previously unavailable NJ Legal Information Create a searchable & browsable repository of previously unavailable NJ Legal Information 3 Collections: 3 Collections: 1) New Jersey Administrative Reports, 1979-1991 2) New Jersey Executive Orders, 1941 – January 1990 3) New Jersey Attorney General Opinions Collection 1 scanned professionally by Princeton Imaging, 2 & 3 done in house on a flatbed Minolta PS7000 Collection 1 scanned professionally by Princeton Imaging, 2 & 3 done in house on a flatbed Minolta PS7000 OCR quality is good in Collection, poor in 2 and 3 OCR quality is good in Collection, poor in 2 and 3 Available in PDF and DjVu [with embedded OCR text] Available in PDF and DjVu [with embedded OCR text] DjVu created with LizardTech Document Express 3.1 DjVu created with LizardTech Document Express 3.1 PDF created with c42pdf [http://c42pdf.ffii.org/] PDF created with c42pdf [http://c42pdf.ffii.org/]http://c42pdf.ffii.org/

3 Project Requirements Use only open-source tools [other than for document creation] Use only open-source tools [other than for document creation] Need to provide full-text searching and searching within specific metadata fields Need to provide full-text searching and searching within specific metadata fields Documents need to be indexed and retrieved as atomic units, rather than at the page-level Documents need to be indexed and retrieved as atomic units, rather than at the page-level Solution: Solution: Store the metadata and full-text of each document in the same unit and find an indexing program that can index them both. Store the metadata and full-text of each document in the same unit and find an indexing program that can index them both. Ultimate solution: Ultimate solution: Extract OCR text from DjVu files using djvutoxml Extract OCR text from DjVu files using djvutoxml Use XSL to combine djvutoxml output and metadata in xml in a single XHTML file Use XSL to combine djvutoxml output and metadata in xml in a single XHTML file Use swish-e to index and search the XHTML file Use swish-e to index and search the XHTML file

4 Swish-e Basics url: http://www.swish-e.org/ url: http://www.swish-e.org/http://www.swish-e.org/ Simple Web Indexing for Humans – Enhanced Simple Web Indexing for Humans – Enhanced Full-Text indexing program written in C, available freely Full-Text indexing program written in C, available freely Special indexing modes for XML, HTML documents, can index any plain-text format Special indexing modes for XML, HTML documents, can index any plain-text format Uses standard open-source filtering tools to index ps, pdf, word, and ps.gz documents Uses standard open-source filtering tools to index ps, pdf, word, and ps.gz documents Can index both file-systems and over HTTP Can index both file-systems and over HTTP Supports several stemming algorithms Supports several stemming algorithms Supports Boolean searching Supports Boolean searching Supports wildcard and phrase searching Supports wildcard and phrase searching Indexing controlled by standard configuration file format Indexing controlled by standard configuration file format Uses libxml to parse XML|HTML documents Uses libxml to parse XML|HTML documents

5 Why Choose Swish-e It can index and search HTML metatags It can index and search HTML metatags It is fast, index several thousand files in a few seconds It is fast, index several thousand files in a few seconds Decent compression in the index app 700 pages with metadata results in a 13.5 mb index Decent compression in the index app 700 pages with metadata results in a 13.5 mb index Swish:API, a perl module for embedding swish-e in applications available Swish:API, a perl module for embedding swish-e in applications available This module forms the basis of a fairly functional demo web-based search app that can be used to build your own search interface This module forms the basis of a fairly functional demo web-based search app that can be used to build your own search interface Easy to select the meta or xml tags you wish to index and return with search results using the “metatag” and “property” declaration in the swish-e config file Easy to select the meta or xml tags you wish to index and return with search results using the “metatag” and “property” declaration in the swish-e config file Excellent documentation [http://www.swish-e.org/current/docs/] Excellent documentation [http://www.swish-e.org/current/docs/]http://www.swish-e.org/current/docs/ Under active development, version 2.4.2 just released yesterday Under active development, version 2.4.2 just released yesterday

6 XSL Basics Extensible Stylesheet Language [http://www.w3.org/Style/XSL/] Extensible Stylesheet Language [http://www.w3.org/Style/XSL/]http://www.w3.org/Style/XSL/ Really two W3C XML standards Really two W3C XML standards XSLT: a transformation language for XML documents XSLT: a transformation language for XML documents XSL-FO: a powerful language for specifying formatting semantics, much more powerful than CSS, generally used for print publications XSL-FO: a powerful language for specifying formatting semantics, much more powerful than CSS, generally used for print publications Written as well-formed XML Written as well-formed XML Some predict it will take on SQL-like functionality for XML Documents Some predict it will take on SQL-like functionality for XML Documents Based on the paradigm of functional programming Based on the paradigm of functional programming XSLT transformations are executed using an XSLT processor XSLT transformations are executed using an XSLT processor Many Java-based XSLT Processors Many Java-based XSLT Processors I use libxml [http://xmlsoft.org/], a very based C-based library that includes an XML parser and XSLT processor I use libxml [http://xmlsoft.org/], a very based C-based library that includes an XML parser and XSLT processorhttp://xmlsoft.org/ Takes an XML document as input and transforms this into XML, HTML, or plain-text output Takes an XML document as input and transforms this into XML, HTML, or plain-text output The instructions for this transformation are located in XSLT stylesheets The instructions for this transformation are located in XSLT stylesheets Transforms one tree to another Transforms one tree to another

7 XSL Syntax Basics Stylesheets are constructed of a series of “templates” that match nodes or groups of nodes in an XML document Stylesheets are constructed of a series of “templates” that match nodes or groups of nodes in an XML document Example: main XSL stylesheet for djvu2xhtml conversion Example: main XSL stylesheet for djvu2xhtml conversionmain XSL stylesheet for djvu2xhtml conversionmain XSL stylesheet for djvu2xhtml conversion Groups of nodes are selected by written XPATH expressions Groups of nodes are selected by written XPATH expressions XPATH is another W3C standard [http://www.w3.org/TR/xpath] XPATH is another W3C standard [http://www.w3.org/TR/xpath]http://www.w3.org/TR/xpath Purpose “a language for addressing parts of an XML document” Purpose “a language for addressing parts of an XML document” Has a number of familiar procedural constructs: looping, branching, named variables Has a number of familiar procedural constructs: looping, branching, named variables Example of variables: parameter stylesheet for djvu2xhtml Example of variables: parameter stylesheet for djvu2xhtmlparameter stylesheet for djvu2xhtmlparameter stylesheet for djvu2xhtml Some problems: Some problems: Can be slow for large documents [whole document is loaded into memory Can be slow for large documents [whole document is loaded into memory Multiple input and output documents are clunky Multiple input and output documents are clunky String processing is problematic, no regexes, typically need to use recursively structures for complicated tasks String processing is problematic, no regexes, typically need to use recursively structures for complicated tasks

8 DJVUXML tools Part of DjVulibre 3.5.12 or higher Part of DjVulibre 3.5.12 or higher URL: http://djvu.sourceforge.net/doc/man/djvuxml.html URL: http://djvu.sourceforge.net/doc/man/djvuxml.htmlhttp://djvu.sourceforge.net/doc/man/djvuxml.html Does djvused-like (annotations, highlighting) functions using XML syntax Does djvused-like (annotations, highlighting) functions using XML syntax Djvutoxml outputs an XML serialization of a DjVu Document Djvutoxml outputs an XML serialization of a DjVu Document Example Output – results in very large files Example Output – results in very large files Example Output Example Output This reflects line, page and column information, can vary quite a bit from document type to document type This reflects line, page and column information, can vary quite a bit from document type to document type Unrecognized OCR often results in Unicode errors, so use the provided xml2utf8 or xml2utf16 filters Unrecognized OCR often results in Unicode errors, so use the provided xml2utf8 or xml2utf16 filters Provides you with a set coordinates for regions in a DjVu document contrary to what the plug-in understands Provides you with a set coordinates for regions in a DjVu document contrary to what the plug-in understands

9 Workflow 1) Prepare metadata in XML 1) Available in a format based on partly Dublin core, part in-house tags format 2) This was extracted from static HTML pages 2) Prepare customized metadata and display information for the documents to be transformed: example example 1) I use emacs nxml-mode for editing XML documents nxml-mode 3) Invoke DjVuXML commands 4) Transform documents to XHTML: example example 5) Prepare Swish-e index 1) Put in meta and properties information in config file config 6) Prepare Search Interface 1) Put in meta and property information in cgi interface config config 2) Put in display related meta and property information in search template file search template filesearch template file

10 Problems Use of space could develop into an issue Use of space could develop into an issue XSLT transformations using the djvuxml format are too slow to be used in any real-time processing, must be done in batch XSLT transformations using the djvuxml format are too slow to be used in any real-time processing, must be done in batch Updating or adding metadata must be done by hand or by program, no data entry interface Updating or adding metadata must be done by hand or by program, no data entry interface Swish-e has limited support for indexing XML attributes Swish-e has limited support for indexing XML attributes Swish-e can only index specific fields in XML documents that are defined as properties Swish-e can only index specific fields in XML documents that are defined as properties To enable highlighting in DjVu Documents will need to solve the coordinate problem To enable highlighting in DjVu Documents will need to solve the coordinate problem Complicated modifications to the search interface are time consuming and require you to learn on of the perl HTML template mechanisms, like Template::Toolkit or HTML::Template Complicated modifications to the search interface are time consuming and require you to learn on of the perl HTML template mechanisms, like Template::Toolkit or HTML::Template

11 Future Directions Explore fully Aware XML indexing engines Explore fully Aware XML indexing engines Amberfish Amberfish Amberfish eXist – example Apps, based on XQuery eXist – example Apps, based on XQuery eXistApps eXistApps Xindice Xindice Xindice Search Interface Improvements Search Interface Improvements Take the user directly to their keyword in the document Take the user directly to their keyword in the document Dynamically generate the browsing pages for the collection based on information in the metadata files [currently static HTML] Dynamically generate the browsing pages for the collection based on information in the metadata files [currently static HTML] DjVuXSL Stylesheet Improvement DjVuXSL Stylesheet Improvement Work on string processing capabilities to recognize paragraphs, lists Work on string processing capabilities to recognize paragraphs, lists Rework the use of the document() to improve processing speed Rework the use of the document() to improve processing speed Try XSLT 2.0, to see if the new string processing capabilites can help Try XSLT 2.0, to see if the new string processing capabilites can help Learn more about the structure of DjVu documents to make the stylesheets more reliable Learn more about the structure of DjVu documents to make the stylesheets more reliable

12 Useful Links DjVuXSL DjVuXSL DjVuXSL Stylesheets Homepage DjVuXSL Stylesheets HomepageHomepage Guide to Dublin Core in HTML Guide to Dublin Core in HTML Guide Swish-e Swish-e Current Swish-e Documentation Current Swish-e DocumentationDocumentation XSL XSL XSL-List XSL-List XSL-List Jenni Tennison's XSLT Pages Jenni Tennison's XSLT Pages Jenni Tennison's XSLT Pages Jenni Tennison's XSLT Pages Book: XSLT Programmer's Reference Book: XSLT Programmer's ReferenceXSLT Programmer's ReferenceXSLT Programmer's Reference XSLT 1.0 Tutorial XSLT 1.0 Tutorial XSLT 1.0 Tutorial XSLT 1.0 Tutorial XSLT 2.0 Introduction XSLT 2.0 Introduction XSLT 2.0 Introduction XSLT 2.0 Introduction XSLT 2.0 Implementation XSLT 2.0 Implementation XSLT 2.0 Implementation XSLT 2.0 Implementation


Download ppt "XSL, Swish-e and DjVu Kevin Reiss Rutgers-Newark School of Law Library March 10 th, 2004 TAG Meeting."

Similar presentations


Ads by Google