Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improve the way you create, manage and distribute information www.innodata-isogen.com INNOVATION INSPIRATION Automating Content Analysis with Trang and.

Similar presentations


Presentation on theme: "Improve the way you create, manage and distribute information www.innodata-isogen.com INNOVATION INSPIRATION Automating Content Analysis with Trang and."— Presentation transcript:

1 Improve the way you create, manage and distribute information www.innodata-isogen.com INNOVATION INSPIRATION Automating Content Analysis with Trang and Simple XSLT Scripts Bob DuCharme XML 2008 December 9, 2008

2 2 2 What We Do We help companies lower the cost of creating and managing information.

3 3 3 About me Solutions Architect, Innodata Isogen weblog: http://www.snee.com/bobdc.blog other writing: See http://www.snee.com/bob URLs referenced today: http://www.snee.com/xml/xml2008

4 4 4 Single source publishing and “editorial” XML Input 1 Process B Input 2 Input 3 Process A Process C Process D Process F Editorial Master (XML) Input 4 Input 5 Process E Output 2 Output 3 Output 1

5 5 5 Content analysis: why? You’ve “inherited” some content Convert to your current editorial format Convert it to new output formats Efficient development of efficient conversion routines

6 6 6 Handy tool 1 before we get to the XML parts: sort colors.txt: red green blue green blue red $ sort colors.txt blue green red

7 7 7 Handy tool 2 before we get to the XML parts: uniq sort colors.txt | uniq -c 3 blue 2 green 2 red

8 8 8 Sample data

9 9 9 trang From http://www.thaiopensource.com/relaxng/trang.html: Trang converts between different schema languages for XML. It supports the following languages: RELAX NG (XML syntax) RELAX NG compact syntax XML 1.0 DTDs W3C XML Schema A schema written in any of the supported schema languages can be converted into any of the other supported schema languages, except that W3C XML Schema is supported for output only, not for input. Trang can also infer a schema from one or more example XML documents.

10 10 trang Trang can also infer a schema from one or more example XML documents!!!!!

11 11 Analyzing content with trang Here is one document Here is another

12 12 Create RELAX NG versions of … Elsevier article DTD: trang art510.dtd art510.rng Combined sample content: trang issueContents.xml issueContents.rng Compare results: saxon art510.rng compareElsRNG.xsl | sort > compareElsRNG.out

13 13 compareElsRNG.xsl (1 of 2) <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:r="http://relaxng.org/ns/structure/1.0"> <xsl:variable name="schema“ select="document('issueContents.rng')"/>

14 14 compareElsRNG.xsl (2 of 2) Yes: No:

15 15 compareElsRNG.xsl: some sample output No: tb:colspec No: tb:left-border No: tb:right-border No: tb:top-border Yes: aid Yes: article Yes: body Yes: ce:abstract Yes: ce:abstract-sec Yes: ce:acknowledgment Yes: ce:affiliation

16 16 Analyzing the XML itself Or SGML, after using James Clark’s sx: sx -f err.out -x lower myfile.sgm > myfile.xml

17 17 Counting elements: countElements.xsl

18 18 Using countElements.xsl to count elements saxon issueContents.xml countElements.xsl | sort | uniq -c | sort

19 19 Result of counting elements Start of list: 1 ce:chem 1 ce:displayed-quote 1 ce:inline-figure 1 ce:nomenclature 1 ce:textbox 1 ce:textbox-body 1 ce:underline 1 ce:vsp 1 doc 1 sb:e-host 2 small-caps 3 display 3 formula End of list: 5726 ce:cross-ref 6916 entry 7225 mml:mo 7760 sb:maintitle 7760 sb:title 7929 ce:label 8458 ce:hsp 9326 mml:mi 10331 mml:mrow 12438 ce:italic 16453 sb:author 17082 ce:given-name 17095 ce:surname

20 20 Count element/parent combinations /

21 21 Some parent/child counts 1 ce:displayed-quote/ce:simple-para 59 ce:biography/ce:simple-para 107 ce:legend/ce:simple-para 115 ce:abstract-sec/ce:simple-para 859 ce:caption/ce:simple-para

22 22 countAttributes.xsl <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> /@

23 23 Counting the attributes: an excerpt 1 ce:textbox/@id 28 ce:enunciation/@id 44 ce:table-footnote/@id 50 ce:biography/@id 79 ce:footnote/@id 104 ce:correspondence/@id 142 ce:table/@id 175 ce:affiliation/@id 180 ce:formula/@id 182 ce:section/@id 713 ce:figure/@id 4224 ce:bib-reference/@id

24 24 Count formula elements with/without ID values <xsl:stylesheet version="1.0" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> Yes: No:

25 25 Find all values of a particular attribute <xsl:stylesheet version="1.0" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

26 26 Running OneAttValue.xsl xsltproc OneAttvalue.xsl issueContents.xml | sort | uniq -c | sort Output ending like this: 10 gr12 11 gr11 14 gr10 17 fx1 17 fx2 18 gr9 24 gr8 37 gr7 55 gr6 67 gr5 91 gr4 99 gr3 103 gr1 103 gr2

27 27 Output just the comments in a document <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

28 28 Output just the processing instructions in a document <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

29 29 elAttList.xsl goal Go through rng schema For each element, output dtdname.dtd\telementName For each attribute, output dtdname.dtd\telementName\tattributeName

30 30 elAttList.xsl part 1 of 2 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:r="http://relaxng.org/ns/structure/1.0" version="1.0"> <xsl:param name="dtdname" >no dtdname parameter supplied

31 31 elAttList.xsl part 1 of 2 <xsl:for-each select="r:attribute | r:optional/r:attribute">

32 32 normalizeRNG.xsl <xsl:stylesheet version="1.0“ xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:r="http://relaxng.org/ns/structure/1.0" > <xsl:apply-templates select="//r:define[@name = $referent]“ mode="copying"/>

33 33 Analyzing an SGML DTD Why? When migrating away from it RNG or W3C XSD both XML, but not SGML Using Earl Hood’s perlSGML DTD analysis tools

34 34 XML-based analysis of SGML DTD 1.Run Earl Hood’s dtd2html utility 2.Run tagsoup or HTML Tidy on output files 3.Now you’ve got XML where you can pull out element information with XSLT

35 35 XML-based analysis of SGML DTD (revised) 1.Tweak dtd2html to add elements 2.Run Earl Hood’s dtd2html utility 3.Run tagsoup or HTML Tidy on output files 4.Now you’ve got XML where you can pull out element information with XSLT

36 36 Summary This is not an integrated report generator. It’s Legos. Pipelining data between existing tools, re-usable scripts, and quick hacks. Document your command lines, e.g. saxon temp1.xml temp3.xsl > temp1a.xml Clients like reports, especially in spreadsheets.

37 37 Thank you! Referenced resources: http://www.snee.com/xml/xml2008


Download ppt "Improve the way you create, manage and distribute information www.innodata-isogen.com INNOVATION INSPIRATION Automating Content Analysis with Trang and."

Similar presentations


Ads by Google