Presentation is loading. Please wait.

Presentation is loading. Please wait.

Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Similar presentations


Presentation on theme: "Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!"— Presentation transcript:

1 Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

2 One rule of thumb: Select and combine strategies for conversion appropriate for your available sources / study documentation. Select and combine strategies for conversion appropriate for your available sources / study documentation.

3 Different sources will give you different parts of the DDI. DDI spss, sas, stata pdf text codebook XML html database Excel delimited text osiris, marc, … Study info Categories Quest. text Locations Freq DDI Vargrps Process the different sources and assemble/merge the result

4 Most common study documentation “combo”: Statistical package file(s) Statistical package file(s) Machine-readable codebook and/or questionnaire: ASCII or PDF Machine-readable codebook and/or questionnaire: ASCII or PDF Example: ICPSR study no. 3356 3356

5 Step one: Convert statistical package file(s) Programs: 1) XCONVERT – free to download at http://sda.berkeley.edu:7502/ddi/tools/, http://sda.berkeley.edu:7502/ddi/tools/ Created by the SDA Project, CSM Program, UC Berkeley. 2) Nesstar’s Publisher – commercial software, see http://www.nesstar.com/products/publisher http://www.nesstar.com/products/publisher 3) Currently, SPSS and SAS are working on tools to directly export to DDI

6 XCONVERT converts to DDI: SPSS dds (syntax) SPSS dds (syntax) SAS dds (syntax) SAS dds (syntax) Stata dds (.do+ dictionary files) Stata dds (.do+ dictionary files) Resulting DDI markup has no frequencies. Frequencies may be obtained only when converting to DDI with the SDATOXML program, available to SDA subscribers. XCONVERT does NOT convert dds for hierarchical data files.

7 Exercise 1: Convert Stata dds to DDI using XCONVERT Download XCONVERT to same folder where you have your Stata dds files. Download XCONVERT to same folder where you have your Stata dds files. In a text editor, combine the two Stata dds files (.do and.dct) in one single file that you can save as.txt In a text editor, combine the two Stata dds files (.do and.dct) in one single file that you can save as.txt Conversion command (run in DOS): xconvert –x stata –i inputfile –o outputfile.xml Conversion command (run in DOS): xconvert –x stata –i inputfile –o outputfile.xml

8 Nesstar Publisher converts to DDI: SPSS dds (syntax) SPSS dds (syntax) (Merge in raw data file to obtain frequencies) SPSS portable/export SPSS portable/export SPSS system SPSS system Stata system (ex.: ICPSR study no. 3740) Stata system (ex.: ICPSR study no. 3740) DDI obtained from system/portable files will have no column locations. Nesstar Publisher does NOT import dds for hierarchical data files.

9 Exercise 2: Convert SPSS dds to DDI using Nesstar Publisher Edit your SPSS dds: delete comment box, and any other additional lines down to data list. Edit your SPSS dds: delete comment box, and any other additional lines down to data list. Make your first line read: DATA LIST/ Make your first line read: DATA LIST/ Remove “comment out” star from missing values section. Remove “comment out” star from missing values section. Save as.sps Save as.sps Import into Nesstar Publisher using “File-import” command. Import into Nesstar Publisher using “File-import” command. Import ASCII data file using “Data-Insert data matrix from fixed format set” command. Import ASCII data file using “Data-Insert data matrix from fixed format set” command. Export DDI, or save in.NSDstat format for further additions. Export DDI, or save in.NSDstat format for further additions.

10 Step two: Convert PDF documentation to text format Use xpdf (available from http://www.foolabs.com/) http://www.foolabs.com/ Command type: pdftotext –layout infilename outfilename (Preservation of formatting is NOT guaranteed)

11 Exercise 3: Convert PDF codebook to text format Download xpdf program to same folder as your PDF codebook. Download xpdf program to same folder as your PDF codebook. Conversion command (run in DOS): Conversion command (run in DOS): pdftotext –layout infilename outfilename (-layout option increases chances for preserving regular text format)

12 Step three: Extract from codebook, and tag in DDI, question text and other relevant variable-level information For codebooks with regular format, apply text- processing techniques – like macros, or regular expressions syntax – in a powerful text editor, like TextPad or emacs. For codebooks with regular format, apply text- processing techniques – like macros, or regular expressions syntax – in a powerful text editor, like TextPad or emacs. Make sure your final product is well-formed XML and DDI compliant!!!

13 Textpad Textpad is a powerful plain text editor available from http://www.textpad.com Textpad is a powerful plain text editor available from http://www.textpad.comhttp://www.textpad.com Cost: $16 - $29, depending on volume Cost: $16 - $29, depending on volume Includes regular expressions search and replace and other nice features Includes regular expressions search and replace and other nice features

14 Regular Expressions Regular expressions are a special syntax that describes patterns in a text. They appear as strings of ordinary characters which take on special meanings. Regular expressions are a special syntax that describes patterns in a text. They appear as strings of ordinary characters which take on special meanings.

15 Regular expressions: examples. any single character. any single character [^a] any character, except “a” [^a] any character, except “a” [0-9] any single digit [0-9] any single digit [0-9]{2,4} any sequence of min. 2 and max. 4 [0-9]{2,4} any sequence of min. 2 and max. 4 digits digits ^ beginning of line ^ beginning of line $ end of line $ end of line + zero or more of preceding + zero or more of preceding characters or expressions characters or expressions

16 Exercise 4: Create DDI file containing variables names and question text Open your.txt codebook in TextPad Open your.txt codebook in TextPad Use regular expressions-based commands, and other TextPad special features to: Use regular expressions-based commands, and other TextPad special features to: -Delete unnecessary text -Delete unnecessary text -Attach DDI tags to the appropriate sections of text -Attach DDI tags to the appropriate sections of text (Instructions provided) (Instructions provided) Insert codebook beginning- and end-tags to create valid DDI. Insert codebook beginning- and end-tags to create valid DDI. Save as.xml Save as.xml

17 Step three (continued): Create variable groups Use Nesstar Publisher’s “Variable Groups” feature. Use Nesstar Publisher’s “Variable Groups” feature.OR, Use SDA’s VARGROUP script to produce DDI markup. Use SDA’s VARGROUP script to produce DDI markup. (A word of warning! If using SDA’s VARGROUP, replace commas with spaces in the DDI output file, as commas are not allowed in attributes!)

18 Exercise 5: Create DDI markup for variable groups using SDA’s VARGROUP Open your.txt codebook in TextPad. Open your.txt codebook in TextPad. Use regular expressions-based commands, and other special TextPad features, to produce input file for VARGROUP script (instructions provided). Use regular expressions-based commands, and other special TextPad features, to produce input file for VARGROUP script (instructions provided). Download VARGROUP program to same folder as your input file. Download VARGROUP program to same folder as your input file. Conversion command (run in DOS): vargroup –i inputfile Conversion command (run in DOS): vargroup –i inputfile In TextPad, replace commas with spaces in the DDI output file. In TextPad, replace commas with spaces in the DDI output file.

19 Step four: Merge or combine DDI files to generate information-rich codebook To combine (attach new sections): Use XML- or text- editing software to insert new sections in the appropriate sequence (but beware of producing invalid documents!). To combine (attach new sections): Use XML- or text- editing software to insert new sections in the appropriate sequence (but beware of producing invalid documents!). To merge: Use Nesstar Publisher or To merge: Use Nesstar Publisher orXSLT.

20 Nesstar Publisher’s merge feature Will merge in: Entire sections of the DDI. Entire sections of the DDI. Individual fields within each section. Individual fields within each section. Using this feature will enable you to write in newly added tags or overlay tags that already have content. Key for merges is Key for merges is

21 Exercise 6: Use Nesstar Publisher to merge DDI files documenting different parts of the same study In Nesstar Publisher, open the saved.NSDstat file (reimporting the DDI will result in loss of frequencies). In Nesstar Publisher, open the saved.NSDstat file (reimporting the DDI will result in loss of frequencies). Use the “Documentation – Import from DDI” command, to merge in the Question Text file. Use the “Documentation – Import from DDI” command, to merge in the Question Text file. Use the same command to merge in an ICPSR catalog record covering Sections 2 (Study Description) and 3 (File Description) of the DDI. Use the same command to merge in an ICPSR catalog record covering Sections 2 (Study Description) and 3 (File Description) of the DDI.

22 Review Regular expressions are very powerful and worth your time to learn Regular expressions are very powerful and worth your time to learn XCONVERT can extract DDI variables and categories (but not frequencies) XCONVERT can extract DDI variables and categories (but not frequencies) Nesstar can work directly with statistical data files to extract frequencies Nesstar can work directly with statistical data files to extract frequencies Nesstar can merge DDI information from different sources. Nesstar can merge DDI information from different sources.

23 Automation

24 Automation Approaches to Automation Approaches to Automation –PROGRAMMING: Use a programming language such as java, C#, VB, perl, PHP, ColdFusion –COCOON: Use an XML publishing framework such as Apache Cocoon (PLUG) –UNIX: Adapt/reuse existing scripts using UNIX (Linux, Mac OS X)-based tools

25 Automation Recommendations Use UNIX to glue existing scripts together Use UNIX to glue existing scripts together Use XSLT Use XSLT Use Cocoon or scripts to process XML Use Cocoon or scripts to process XML Code new functionality as necessary, with command-line wrappers Code new functionality as necessary, with command-line wrappers DDI Scripts UNIX XSLT Cocoon XSLT INOUT

26 Survey of DDI and XML Tools ToolPlatformsSourcesResultsLicense* SDA’s XCONVERT, VARGROUP XCONVERT UNIX, Windows Stat package files (SPSS, SAS, Stata) DDI (no frequencies) free Oracle XML Developer’s Kit (XDK) XDK UNIX, Windows XML, XSLT anyfree DDI_DTD.cifBlaiseBlaise“xml”free MSXML 4.0 Windows XML, XSLT anyfree GESIS spssoms2ddi spssoms2ddi XSLT SPSS OMS XML DDIGNU HTML Tidy Tidy UNIX, Windows Badly formed html xhtmlopen * Check licensing terms

27 How do I use XSLT stylesheets? Browser (IE and Mozilla) Browser (IE and Mozilla) Programming language (many libraries and APIs) Programming language (many libraries and APIs) Server (Xalan, Xerces, xt, Saxon) Server (Xalan, Xerces, xt, Saxon) Apache Cocoon Apache Cocoon Command line (Oracle XDK or MSXML 4.0) Command line (Oracle XDK or MSXML 4.0) Windows shortcut Windows shortcut

28 Automation Exercise 1 Apply an xslt stylesheet in various ways Apply an xslt stylesheet in various ways Open the folder “xslt” and follow the instructions in “oraxsl lesson.txt” Open the folder “xslt” and follow the instructions in “oraxsl lesson.txt”

29 XSLT advantages When the source is XML, XSLT can output to XML, text, pdf, even jpeg When the source is XML, XSLT can output to XML, text, pdf, even jpeg This might be done directly, or possibly via an intermediate format and a conversion tool/library such as html2pdf, fop This might be done directly, or possibly via an intermediate format and a conversion tool/library such as html2pdf, fop Cocoon has a large number of such libraries built in Cocoon has a large number of such libraries built in XSLT stylesheets can be reused in java, C#, perl, PHP, ColdFusion. XSLT stylesheets can be reused in java, C#, perl, PHP, ColdFusion. XSLT stylesheets are easier to modify if the xml changes or needs to be parsed differently XSLT stylesheets are easier to modify if the xml changes or needs to be parsed differently

30 XSLT drawbacks Not in typical skillset — functional programming is different from OO and procedural Not in typical skillset — functional programming is different from OO and procedural Memory hog — the entire document is loaded into memory and expanded Memory hog — the entire document is loaded into memory and expanded –Doc size/content ratio = 20+ –Solutions:  Preprocess using streaming parser  Allot more memory –java -Xms -Xmx –java -Xms -Xmx

31 A Survey of UNIX Tools UNIX Text Processing Tools UNIX Text Processing Tools –sed, awk, tr, cut, head, … Pipes Pipes –Allows the results of one command to be sent to another UNIX batch commands UNIX batch commands –ls, grep, xargs UNIX scheduling UNIX scheduling –cron

32 Introduction to sed Sed performs line-by-line substitutions using regular expressions Sed performs line-by-line substitutions using regular expressions sed –f commandsfile sourcefile > destinationfile

33 Automation Exercise 2 We’ll use sed to duplicate the functionality of a textpad macro we created previously We’ll use sed to duplicate the functionality of a textpad macro we created previously Open the folder “sed” and follow the instructions in “sed lesson.txt” Open the folder “sed” and follow the instructions in “sed lesson.txt” WARNING 1: sed’s regular expressions are slightly different from textpad’s WARNING 1: sed’s regular expressions are slightly different from textpad’s WARNING 2: sed by default processes line-by-line WARNING 2: sed by default processes line-by-line Sed is available on all unix systems. See “README_download_instructions” for windows machines

34 spss, sas, stata pdf text codebook XML html database/Excel delimited text CAI, Blaise osiris, marc, … Sources Review DDI textpad The functionality of textpad on windows can be replaced by sed or awk on UNIX Automation Translating manual steps to automated steps

35 Sources Review DDI pdf2text Textpad/sed xconvert The functionality of textpad on windows can be replaced by sed or awk on UNIX spss, sas, stata pdf text codebook XML html database/Excel delimited text CAI, Blaise osiris, marc, …

36 Automation Exercise 3 Hooking things together with pipes (or files) Hooking things together with pipes (or files) Open the folder “automate” and follow the instructions in “automate lesson.txt” Open the folder “automate” and follow the instructions in “automate lesson.txt” Batch processing with ls, sed, grep, and xargs Batch processing with ls, sed, grep, and xargs

37 Advice for Batch Processing Use a consistent naming convention Use a consistent naming convention Identify the driving files Identify the driving files Schedule using cron Schedule using cron

38 Sources for Automation Not every process is suited for automation Not every process is suited for automation A process may be partially automated A process may be partially automated Sources which are formatted in a regular manner are ideal for automation Sources which are formatted in a regular manner are ideal for automation –Database output –Excel spreadsheets –Delimited text –Machine-generated output

39 Make use of intermediate formats A candidate for an intermediate regular format that already has scripts/tools written for it can simplify your work. A candidate for an intermediate regular format that already has scripts/tools written for it can simplify your work. Candidates: Candidates: –Delimited text –Xml –Html –Proprietary format (SDA’s DDL, SPSS’s __)

40 Using the Intermediate Format Strategy: Example 1 Gesis spssoms2ddi is an example of using the intermediate format strategy Gesis spssoms2ddi is an example of using the intermediate format strategyspssoms2ddi SPSS file SPSS OMS XML DDI Spssoms2ddi stylesheet study_oms.spss This is an example of doing it the right way: SPSS outputs proper XML according to a schema

41 Using the Intermediate Format Strategy: Example 2 XCONVERT does not output frequencies XCONVERT does not output frequencies SAS ODS command wrapper displays output as (badly formed) html tables SAS ODS command wrapper displays output as (badly formed) html tables SAS HTML frequencies xhtmlDDI ODS HTML tidy xslt Oracle Delimited text xslt sqlldr

42 SAS ODS SAS ODS is able to output its results as html instead of.lst or.rtf file SAS ODS is able to output its results as html instead of.lst or.rtf file Just wrap your run statement Just wrap your run statement ODS html file=“result.htm” your sas code …proc print data =new; run; ODS html close;

43 SAS ODS HTML output bad html – verbose, mismatched nesting bad html – verbose, mismatched nesting Show example Show example Xslt cannot be applied directly to this output Xslt cannot be applied directly to this output Use HTML tidy (open source) to clean this bad html before applying xslt style sheets Use HTML tidy (open source) to clean this bad html before applying xslt style sheets tidy options sourcefile > resultfile tidy options sourcefile > resultfile HTML tidy is built into Apache Cocoon HTML tidy is built into Apache Cocoon

44 Automation Exercise 4 HTML Tidy allows you to deal with badly formed xml/html that naturally occur in the real world HTML Tidy allows you to deal with badly formed xml/html that naturally occur in the real world Open the folder “tidy” and follow the instructions in “tidy lesson.txt” Open the folder “tidy” and follow the instructions in “tidy lesson.txt”

45 Sources DDI pdf2text sed xconvert oraxsl + stylesheet ODS HTML tidy spss, sas, stata pdf text codebook XML html database/Excel delimited text CAI, Blaise osiris, marc, …

46 Database sources Use intermediate formats such as xml or html Use intermediate formats such as xml or html Some databases can output directly to “xml” or “html”, but delimited text is fine Some databases can output directly to “xml” or “html”, but delimited text is fine Usually, the “xml” output needs to be cleaned by HTML tidy Usually, the “xml” output needs to be cleaned by HTML tidy

47 Excel as an editing/automation tool Excel can read/write delimited text Excel can read/write delimited text Excel can read html Excel can read html Excel has macros Excel has macros Excel rowset demo/exercise Excel rowset demo/exercise

48 spss, sas, stata pdf text codebook XML html database/Excel delimited text CAI, Blaise osiris, marc, … Sources DDI pdf2text sed xconvert oraxsl + stylesheet ODS HTML tidy

49 Sources & Destinations DDI spss, sas, stata pdf text codebook XML html database Excel delimited text osiris, marc, … XSLT spss, sas, stata pdf text codebook XML html database/Excel delimited text CAI, Blaise osiris, marc, …

50 DDI to MARC Sometimes, XSLT will only get you 99% of the way Sometimes, XSLT will only get you 99% of the way MARC output requires control characters which are illegal in XML/XSLT MARC output requires control characters which are illegal in XML/XSLT Strategy1: output substitute characters and then use tr or sed to replace control characters Strategy1: output substitute characters and then use tr or sed to replace control characters oraxsl 06084.xml 00.xsl temp1.xml oraxsl temp1.xml 00.xsl temp2.xml oraxsl temp2.xml 00.xsl temp3.xml oraxsl temp3.xml 00.xsl temp4.txt sed -f restoreIllChars.sed > 06084.marc oraxsl $1.xml 00.xsl temp1.xml oraxsl temp1.xml 00.xsl temp2.xml oraxsl temp2.xml 00.xsl temp3.xml oraxsl temp3.xml 00.xsl temp4.txt sed -f restoreIllChars.sed > $1.marc rm -f temp?.xml temp4.txt

51 DDI to Marc Revised strategy: after working with MARC for a while, we decided that we could make use of existing utilities Revised strategy: after working with MARC for a while, we decided that we could make use of existing utilities –1. convert DDI to marcxml (with xslt stylesheet written at icpsr) using oraxsl –2. convert marcxml to marc21 using marc4j Marc4j and other marc utilities are available at http://www.loc.gov/marc/marctools.html Marc4j and other marc utilities are available at http://www.loc.gov/marc/marctools.html

52 Contact info Sanda Ionescu Sanda Ionescu –sandai@icpsr.umich.edu I-Lin Kuo (until Aug 18) I-Lin Kuo (until Aug 18) –ikuo@icpsr.umich.edu


Download ppt "Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!"

Similar presentations


Ads by Google