Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distribution A: Approved for public release; distribution is unlimited. Case Number: 88ABW-2015-1636, 31 Mar 2015 A Tool that Uses the SAS PRX Functions.

Similar presentations


Presentation on theme: "Distribution A: Approved for public release; distribution is unlimited. Case Number: 88ABW-2015-1636, 31 Mar 2015 A Tool that Uses the SAS PRX Functions."— Presentation transcript:

1 Distribution A: Approved for public release; distribution is unlimited. Case Number: 88ABW-2015-1636, 31 Mar 2015 A Tool that Uses the SAS PRX Functions to Fix Delimited Text Files By: Paul Genovesi

2 2 Distribution A: Approved for public release; distribution is unlimited. Case Number: 88ABW-2015-1636, 31 Mar 2015 A Tool that Uses the SAS ® PRX Functions to Fix Delimited Text Files Paul Genovesi Henry Jackson Foundation for the Advancement of Military Medicine, Inc. A Tool that Uses the SAS ® PRX Functions to Fix Delimited Text Files Paul Genovesi Henry Jackson Foundation for the Advancement of Military Medicine, Inc. Abstract Objectives Truncated-Only Method For use on broken delimited text files containing truncated records but no appended records Within your broken text file, the first field of every record and its following field delimiter must occur on the same line (i.e., a record separator can’t occur between them) Does not use your last field and first field patterns Uses a built-in pattern and delimiter counting Delimited text files are often plagued by appended and/or truncated records. The file_fixing_tool can fix these files so they can be imported into SAS. Delimited text file structure (both normal and broken) Truncated-only method Appended method using first field and last field patterns Text qualifying Common surrounding characters (CSCs) Before and after examples of fixed delimited text files Delimited Text File Structure Four structure types (1 normal and 3 broken) where [-----] = one record and = the record separator [-----] Normal Records Truncated Records [---- -] [-----] [--- --] [-----] Note: Truncating occurs within records 1 and 3. Appended Records [-----][-----] [-----][-----][-----] [-----] [-----][-----] Note: Appending occurs after records 1, 3, 4, 7. Appended & Truncated [-----][---- -] [-----] [--- --][-----][-----] [--- --] Note: Appending occurs after records 1, 4, 5. Truncating occurs within records 2, 4, 7. Appended Method Fixing appended records (with or without truncated records) is more difficult than fixing truncated alone. The key is developing last field and first field patterns that identify and isolate either the last field or the first field from all other fields. The other fields are easily identified and isolated by their surrounding field delimiters but there is no field delimiter separating one record’s last field from the following record’s first field. Last Field, First Field Pattern Examples Ex. #1: SSN Contents: 9-digit string Pattern: \d{9} Ex. #2: SSN, blank fields Contents: 9-digit string, Pattern: (?:\d{9}| *) Ex. #3: Categorical Data, case-insensitive Contents: red, white, blue, RED, WhiTe, bLUe, Pattern: (?i:red|white|blue| *) Ex. #4: Number between 1 and 1000000 Contents: 1 to 7 digit character string, Pattern: (?:\d{1,7}| *) Ex. #5: Date Contents: Date string (with format MM/DD/YYYY), Pattern: (?:\d{2}\/\d{2}\/\d{4}| *) Common Surrounding Characters CSCs are characters that are contained within a field (i.e., cell) yet occur on the field’s outer, exterior edges but still within any existing text qualifiers. The following CSCs can occur within text-qualified fields: 1.The field delimiter 2.Text-qualifier character pairs (i.e., a consecutive EVEN number of them, for example “”, “”””, …) 3.The other (i.e., not being used) text qualifier character (for example, if “ is being used, then other is ‘ and vice versa) 4.The space character The following CSCs can occur within non text-qualified fields: 1.The double quote character 2.The single quote character 3.The space character The file_fixing_tool automatically matches a continuous string of CSCs occurring on a field’s outer, exterior edges. Why are they automatically matched? Reason #1: If they weren’t, then your last field and first field patterns would have to account for their existence. Reason #2: Accounting for the existence of double or single quotes in a pattern contained in a macro variable can be tricky in terms of unmatched quote errors. It’s safer to let the file_fixing_tool take care of it. There are no known side effects to doing it this way. Text Qualifying Occurs when a cell’s contents within a delimited text file are enclosed in double quotes or single quotes Both the double quotes and single quotes cannot be used as text qualifiers within the same file. A cell’s contents MUST be text qualified when (1) they contain the field delimiter or (2) the text qualifier being used also occurs within these contents (this text qualifier character must be escaped with another text qualifier character, in other words, two mean one). Conclusions Use the appended method if you are able to develop last field and first field patterns that identify and isolate the last and first fields. If you’re not able to do this, then you can still use the truncated-only method as long as your broken file contains only truncated records. References CSVReader.com. CSV file format. [Accessed 1 Dec. 2014]. Available from http://www.csvreader.com/csv_format.php.http://www.csvreader.com/csv_format.php Dunn T. Grouping, atomic groups, and conditions: creating if-then statements in Perl RegEx. In: Programming beyond the basics. Proceedings of the SAS Global Forum 2011 Conference; 2011 Apr 4-7; Las Vegas, NV. Cary (NC): SAS Institute, Inc.; 2011. Paper 245-2011. [Accessed 1 Mar. 2015]. Available from http://support.sas.com/resources/ papers/proceedings11/245-2011.pdf.http://support.sas.com/resources/ papers/proceedings11/245-2011.pdf Genovesi P. Learning SAS’s perl regular expression matching the easy way: by doing. [Accessed 1 Mar. 2015]. Available from http://www.sascommunity.org/mwiki/images/1/1b/ RF-10-2014-Learning_SAS%27s_Perl_Regular_Expression_ Matching_the_Easy_Way_By_Doing.pdf.http://www.sascommunity.org/mwiki/images/1/1b/ RF-10-2014-Learning_SAS%27s_Perl_Regular_Expression_ Matching_the_Easy_Way_By_Doing.pdf Shafranovich Y. RFC 4180: common format and MIME type for comma-separated values (CSV) Files. 2005. [Accessed 1 Mar. 2015]. Available from http://tools.ietf.org/html/ rfc4180.http://tools.ietf.org/html/ rfc4180 Wikipedia. Comma-separated values. [Accessed 1 Dec. 2014]. Available from http://en.wikipedia.org/wiki/ Comma-separated_values.http://en.wikipedia.org/wiki/ Comma-separated_values ACKNOWLEDGMENTS In memory of Jan Abshire, who gave me an assignment dealing with exactly this issue. A thank you to SAS Institute’s Adam Pilz, who along with Jan gave me the idea for this paper. DISCLAIMER The views expressed are those of the author and do not necessarily reflect the official policy or position of the Air Force, the Department of Defense, or the U.S. Government.

3 3 Distribution A: Approved for public release; distribution is unlimited. Case Number: 88ABW-2015-1636, 31 Mar 2015 Before and After Example #1: A broken delimited text file containing truncated but no appended records Upper Left: A broken delimited text file containing truncated but no appended records. Lower Left: The same delimited text file that was fixed using the truncated-only method. Upper Right: The SAS dataset created by importing the fixed delimited text file into SAS EG.

4 4 Distribution A: Approved for public release; distribution is unlimited. Case Number: 88ABW-2015-1636, 31 Mar 2015 Before and After Example #2: A broken delimited text file containing both truncated and appended records Upper Left: A broken delimited text file containing both truncated and appended records. Lower Left: The same delimited text file that was fixed using the appended method. Upper Right: The SAS dataset created by importing the fixed delimited text file into SAS EG.

5 5 Distribution A: Approved for public release; distribution is unlimited. Case Number: 88ABW-2015-1636, 31 Mar 2015 Before and After Example #3: A delimited text file in an “already-fixed” state meaning it contains no truncated or appended records Upper Left: A delimited text file in an “already-fixed” state, meaning it contains no truncated or appended records. Lower Left: The same delimited text file after using the appended method. Notice that the “already-fixed” state has been maintained except for a few double quotes that get matched to one record’s last field instead of the following record’s first field. This transferring of double quotes is unavoidable since they could logically belong to either field. Upper Right: The SAS dataset created by importing the delimited text file (pictured in lower left) into SAS EG.

6 Distribution A: Approved for public release; distribution is unlimited. Case Number: 88ABW-2015-1636, 31 Mar 2015


Download ppt "Distribution A: Approved for public release; distribution is unlimited. Case Number: 88ABW-2015-1636, 31 Mar 2015 A Tool that Uses the SAS PRX Functions."

Similar presentations


Ads by Google