Presentation is loading. Please wait.

Presentation is loading. Please wait.

PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

Similar presentations


Presentation on theme: "PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode."— Presentation transcript:

1 PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode

2 Outline Problems Solutions & Introducing Regular Expressions Advantages over SAS String Functions Points to note while using Regular Expressions References 2

3 Outline Problems Solutions & Introducing Regular Expressions Advantages over SAS String Functions Points to note while using Regular Expressions References 3

4 Problem: Physical abnormalities 4 SUBJIDTRTABNORMALITY 01-011BANEMIA 01-036DANAEMIA 01-026CANEMEA 01-014BANEMIC

5 Problem: Time point variable … 5 USUBJIDVISITVSDTPRSDTLTMVNTR_RTVNTRTUN 1117-Oct-08Per 1 D01 Predose47/min 123-Nov-08Per 1 D01.5 hr58/min 123-Nov-08Per 1 D 01 01 hr51/min 123-Nov-08Per 1d01 02hr49/min 134-Nov-08day153/min 1903-Feb-09Poststudy56/min

6 …Problem: Time point variable 6 USUBJIDVISITVSDTPRSDTLTMVNTR_RTVNTRTUN 1117-Oct-08Per 1 D01 Predose47/min 123-Nov-08Per 1 D01.5 hr58/min 123-Nov-08Per 1 D 01 01 hr51/min 123-Nov-08Per 1d01 02hr49/min 134-Nov-08day153/min 1903-Feb-09Poststudy56/min

7 …Problem: Time point variable 7 USUBJIDVISITVSDTPRSDTLTMVNTR_RTVNTRTUN 1117-Oct-08 Per 1 D01 Predose 47/min 123-Nov-08Per 1 D01.5 hr58/min 123-Nov-08Per 1 D 01 01 hr51/min 123-Nov-08Per 1d01 02hr49/min 134-Nov-08day153/min 1903-Feb-09Poststudy56/min Time_desc Predose Day 1, 0.5 Hour Day 1, 1 Hour Day 1, 2 Hours Day 1 Poststudy

8 8 … Problem: Time point variable PRSDTLTM D01 d01 day1 Time_desc Day 1

9 Outline Problems Solutions & Introducing Regular Expressions Advantages over SAS String Functions Points to note while using Regular Expressions References 9

10 10 …Ways to approach the problem Traditional --- Using SAS String Functions INDEX TRANWRD SUBSTR ANYALNUM ANYALPHA ANYDIGIT ANYSPACE NOTALNUM NOTALPHA ANYALNUM NOTUPPER ANYALPHA FIND ANYDIGIT FINDC ANYPUNCT ANYSPACE INDEXC NOTALNUM INDEXW NOTALPHA VERIFY NOTDIGIT CALL CATS CALL CATT CALL CATX TRANSLATE SCAN SCANQ CALL SCAN CALL SCANQ COMPARE COMPLEV CALL COMPCOST SOUNDEX COMPGED SPEDIS MISSING RANK REPEAT REVERSE…………

11 11 Alternative Approach to Problem Introducing REGULAR EXPRESSIONS!!

12 12 Introduction – Regular Expressions Powerful technique for searching and manipulating text data. A mini programming language - pattern matching. 2 types – pattern matching functions in SAS SAS Regular Expressions – SAS Version 6.12 PERL Regular Expressions – SAS Version 9

13 13 Steps to use Regular Expressions… Problem Required Portion Pattern Regular Expressions Locate Reqd. Portion Process Data Problem Required Portion Problem

14 14 Step1 - Identify the problem … USUB JID VISITVSDTPRSDTLTMVNTR_ RT VNTR TUN 1117-Oct- 08 Per 1 D01 Predose 47/min 123-Nov- 08 Per 1 D01.5 hr 58/min 123-Nov- 08 Per 1 D 01 01 hr 51/min 123-Nov- 08 Per 1d01 02 hr 49/min 134-Nov- 08 Day153/min 1903-Feb- 09 Poststudy56/min time_desc Predose Day 1, 0.5 Hour Day 1, 1 Hour Day 1, 2 Hours Day 1 Poststudy Problem

15 15 Step2 – Visualize the Required Portion within the source text Required Portion PRSDTLTM Per 1 D01 Predose Per 1.5 hr Per 1 01 hr Per 1 02 hr Poststudy D01 d01 D 01 Day1

16 16 Step 3 – Identify a pattern Pattern PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D 01 01 hr Per 1d01 02 hr Day1 Poststudy Preceding Blank D or d Following Blank One/more digits Following Blank 2- Non Digits EXTRACT

17 17 Step 3 – Identify a pattern Pattern PRSDTLTM Prestudy Per 1 D01 Predose D2 Per 1d01 02 hr Per 1 D 01 01 hr 30 min Poststudy D or d Preceding Blank Following Blank One/more digits Following Blank EXTRACT

18 18 Step 3 – Identify a pattern Pattern D or d Preceding Blank Following Blank One/more digits Following Blank EXTRACT PRSDTLTM Per 1 D01 Predose Per 1 D01 Per 1 D 01 01 hr 30 min Per 1d01 02 hr Day2 Poststudy

19 19 Regular Expressions Syntax...at a glance MetacharacterDescription * Matches the previous sub expression zero or more times + Matches the previous sub expression one or more times ? Matches the previous sub expression zero or one times \d Matches a digit (0-9) \D Matches a non-digit \w Matches a word character (upper or lower case letter, blank, or underscore) [abc] Matches any of the characters in the brackets \( Matches (

20 20 Step 4 – Write the Regular Expression for the pattern Regular Expressions PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D 01 01 hr Per 1d01 02 hr Day1 Poststudy Preceding Blank ("/ /") ? ?? ?

21 21 Step 4 – Write the Regular Expression for the pattern Regular Expressions PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D 01 01 hr Per 1d01 02 hr Day1 Poststudy D or d ("/[Dd] ? ?? ?/")

22 22 Step 4 – Write the Regular Expression for the pattern Regular Expressions PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D 01 01 hr Per 1d01 02 hr Day1 Poststudy 2-Non Digits ("/[Dd] ? ?? ?/")(\D\D)?

23 23 Step 4 – Write the Regular Expression for the pattern Regular Expressions PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D 01 01 hr Per 1d01 02 hr Day1 Poststudy Following Blank ("/[Dd] ? ?? ?/")(\D\D)? ? ?? ?

24 24 Step 4 – Write the Regular Expression for the pattern Regular Expressions PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D 01 01 hr Per 1d01 02 hr Day1 Poststudy One/more digits ("/[Dd] ? ?? ?/")(\D\D)? ? ?? ? \d+

25 25 Step 4 – Write the Regular Expression for the pattern Regular Expressions PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D 01 01 hr Per 1d01 02 hr Day1 Poststudy Following blank ("/[Dd] ? ?? ?/")(\D\D)? ? ?? ? \ \\ \d+ + ++ +

26 26 Step 4 – Write the Regular Expression for the pattern Regular Expressions ("/ ?[Dd](\D\D)? ?\d+ +/") PRSDTLTM Per 1 D01 Predose Per 1 D01.5 hr Per 1 D 01 01 hr Per 1d01 02 hr Day1 Poststudy

27 27 Step 4 – Write the Regular Expression for the pattern Regular Expressions /* Extracting the Day Text portion*/ data day_txt; set lb.ecg(keep = PRSDTLTM); retain day_exp; * defined to describe the day text pattern; day_exp =PRXPARSE end; run; ("/ ?[Dd](\D\D)? ?\d+ +/"); if _n_ = 1 then do ; Metacharacters

28 28 Recap… Steps to use Regular Expressions… Problem Required Portion Pattern Regular Expressions Locate Reqd. Portion Process Data Problem Required Portion Problem

29 29 Recap… Steps to use Regular Expressions… Problem Required Portion Pattern Regular Expressions Locate Reqd. Portion Process Data Problem Required Portion Problem

30 30 Recap… Steps to use Regular Expressions… Problem Required Portion Pattern Regular Expressions Locate Reqd. Portion Process Data Problem Required Portion Problem

31 31 Recap… Steps to use Regular Expressions… Problem Required Portion Pattern Regular Expressions Locate Reqd. Portion Process Data Problem Required Portion Problem

32 32 Step 5 – Locate the Required Portion Locate Reqd. Portion /* Extracting the Day Text portion*/ data day_txt; set lb.ecg(keep = PRSDTLTM); retain day_exp day_nexp; if _n_ = 1 then do ; * defined to describe the day text pattern; day_exp = PRXPARSE("/ ?[Dd](\D\D)? ?\d+ +/"); end; *Locating the day text pattern in the PRSDTLTM var;CALLPRXSUBSTR(day_exp,PRSDTLTM,dayst,dayln); run; Pattern defn Source Variable Stores Start position of matched string Stores length of matched string

33 33 Step 6 – Use other SAS text functions to further process data /* Extracting the Day Text portion*/ data day_txt; set lb.ecg(keep = PRSDTLTM); retain day_exp day_nexp; if _n_ = 1 then do ; * defined to describe the day text pattern; day_exp = PRXPARSE("/ ?[Dd](\D\D)? ?\d+ +/"); end; * Locating the day text pattern in the PRSDTLTM var; CALL PRXSUBSTR(day_exp,PRSDTLTM, dayst, dayln); * Extracting the day text pattern ; day_txt = substrn(PRSDTLTM,dayst,dayln); run; Source Variable Starting Position Length of matched pattern

34 34 …Output PRSDTLTMday_txt Per 1 D01 Predose Per 1 D01.5 hr Per 1 D 01 01 hr Per 1d01 02 hr Day1 Poststudy Extracted string D01 Day1 d01 D 01

35 Outline Problems Solutions & Introducing Regular Expressions Advantages over SAS String Functions Points to note while using Regular Expressions References 35

36 36 Advantages… Compact solution Tremendous flexibility Concise description. Highly unstructured data streams. Multiple matching patterns in one step.

37 Outline Problems Solutions & Introducing Regular Expressions Advantages over SAS String Functions Points to note while using Regular Expressions References 37

38 38 Look before you leap Document thoroughly. Understand patterns. Define before use. Define only once in a data step.

39 Outline Problems Solutions & Introducing Regular Expressions Advantages over SAS String Functions Points to note while using Regular Expressions References 39

40 40 Support.sas.com Paper TU02- An Introduction to Regular Expressions with Examples from Clinical Data - Richard F. Pless, Ovation Research Group, Highland Park, IL SUGI 29-Tutorials - Paper 265-29 An Introduction to Perl Regular Expressions in SAS 9 Ron Cody, Robert Wood Johnson Medical School, Piscataway, NJ An Introduction to PERL Regular Expression in SAS® James J. Van Campen, SRI International, Menlo Park, CA …References

41 Contact : jayshree.garade@cytel.com manjusha.gode@cytel.com 41 Q & A

42 Thank you 42


Download ppt "PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode."

Similar presentations


Ads by Google