Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda.

Similar presentations


Presentation on theme: "Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda."— Presentation transcript:

1 Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda

2 2

3 Regular Expressions “A regular expression is a pattern which specifies a set of strings of characters; it is said to match certain strings.” —Ken Thompson QED Text Editor written by Ken in the 1970s Invented in the 1940s Help celebrate it’s 70 th Year 3

4 Types of Regular Expressions 4

5 How is Regex used in Python? Python “re” Python's built-in "re" module provides excellent support for regular expressions, with a modern and complete regex flavor.regular expressions The only significant features missing from Python's regex syntax are atomic grouping, possessive quantifiers, and Unicode properties.atomic grouping possessive quantifiersUnicode properties Using Regular Expressions in Python The first thing to do is to import the regexp module into your script with “ import re ”. 5

6 How is Regex used in Python? Call re.search(regex, subject) to apply a regex pattern to a subject string. The function returns None if the matching attempt fails, and a Match object otherwise. The Match object stores details about the part of the string matched by the regular expression pattern. Since None evaluates to False, you can easily use re.search() in an if statement. 6

7 How is Regex used in Python? Do not confuse re.search() with re.match(). Both functions do exactly the same, with the important distinction that re.search() will attempt the pattern throughout the string, until it finds a match. re.match() on the other hand, only attempts the pattern at the very start of the string. 7

8 How is Regex used in Python? To get all matches from a string, call re.findall(regex, subject). This will return an array of all non-overlapping regex matches in the string. "Non-overlapping" means that the string is searched through from left to right, and the next match attempt starts beyond the previous match. If the regex contains one or more capturing groups, re.findall() returns an array of tuples, with each tuple containing text matched by all the capturing groups.capturing groups The overall regex match is not included in the tuple, unless you place the entire regex inside a capturing group. 8

9 How is Regex used in Python? More efficient than re.findall() is re.finditer(regex, subject). It returns an iterator that enables you to loop over the regex matches in the subject string: for m in re.finditer(regex, subject). The for-loop variable m is a Match object with the details of the current match. 9

10 How is Regex used in Splunk? Field extraction | rex field=_raw “%UC_CALLMANAGER-(? \d+)-EndPointUnregistered: Configure Line Breaking LINE_BREAKER = [\r\n]+ Filtering and Routing Data to Queues REGEX =(?m)^EventCode=(592|593) Many more……. 10

11 Regex Testing Tools 11 RegExr Reggy RegexPal Regex Buddy Lars Olav Torvik Rubular

12 Regex Reference Texts from the creators of RegexBuddy Introducing Regular Expressions by Michael Fitzgerald Mastering Regular Expressions by Jeffrey Friedl Regular Expressions Cookbook by Jan Goyvaerts Regular Expressions Pocket Reference by Tony Stubblebine

13 Basic Concepts of Regular Expressions Because Knowing leads to Doing

14 14

15 Simple Pattern Matching Matching String Literals Matching Digits and Non-Digits Matching Word and Non-Word Characters Matching Whitespace Matching Any Character 15

16 Matching String Literals Sample Apache Log [06/Dec/2012:14:39: ] "GET /Facelift/answers/swelling HTTP/1.1" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” Literal String Match of the first ip address would be:

17 Matching Digits and Non-Digits \d or \D or [0-9] \d - match digit \D – match non-digit (matches whitespace, punctuation and other characters not used in words) [0-9] - match any number (called a character class) [^0-9] – match any non-number 17

18 Matching Words and Non-Words \w or \W \w – match any word character and is essentially the same as the character class [a-zA-Z0-9] \W – match any non-word character 18

19 Matching Whitespace \s or \S \s – match whitespace (Spaces, Tabs, Line Feeds and Carriage Returns) \S – match any character that is not whitespace. Same as [^\s] 19

20 Character shorthands for whitespace 20 Character ShorthandDescription \fForm Feed \hHorizontal Whitespace \HNot Horizontal Whitespace \nNewline \rCarriage Return \tHorizontal Tab \vVertical Tab (whitespace) \VNot vertical whitespace

21 Matching Any Character Dot (.) Matches any character but line ending characters \b – matches a word boundary without consuming any characters 21

22 Boundaries and Alternation Matching the Beginning and End of Line List of Regex Special Character Alternation and Regex Options Subpatterns Capturing and Named Groups Character Classes Negated Character Classes 22

23 Matching Beginning and End of Line ^ OR $ ^ - matches the beginning of a line $ - matches the end of a line 23

24 List of Regex Special Characters.^*+?|(){}[]\-. -matches any character ^ -matches beginning of the line * -matches zero or more + -matches one or more ? –matches one or more | -used for alternation (choice of patterns to match) () –used for grouping {} –used as a quantifier [] –used with character classes \ -used to make a character literal or as a special regex character - -hyphen is used in a character class range 24

25 Alternation and Options | OR ? | -gives choice of alternate patterns to match, ie: (THE|The|the) (?i) – Case insensitive (?J) –allow duplicate names (?m) –match on duplicate lines (?s) –match on a single line (?U) –match lazy (?X) –Ignore whitespace, comments (?-…) –Unset or turn off options 25

26 Subpatterns Group(s) within a group (THE|The|the) -has three subpatterns (tT)h(e|eir) –matches the, The, their, Their 26

27 Capturing and Named Groups () (? …) OR (?P …)  Store their content in memory (it is) (time to eat) $1 $2 (? \d)  Splunk creates a field of Severity from this named group 27

28 Character Classes [] [aeiou] –only matches the characters inside of the brackets [0-9] –matches a range of characters, using a hyphen [a-zA-Z0-9] –matches all alphanumeric characters 28

29 Negated Character Classes [^…] *** Super important – especially for Splunk field extractions *** [^aeiou] –matches all consonants and NOT vowels [^\s] – match everything that is not a space 29

30 Quantifiers Greedy, Lazy, Possessive Matching a certain number of times 30

31 Greedy, Lazy, Possessive * + ? * - match zero of more times.* -will match all of the characters in the subject text (want to avoid this) + -match one or more \d+ -match all of the digits until there aren’t any more - greedy ? –match 0 or 1 of the preceeding token. colou?r –matches either color or colour 31

32 Matching a Certain Number of Times {} \d{3} -matches 3 digits only \d{1,3} –matches range of 1 to 3 digits \d{1,} -same as \d+ \d{0,} -same as \d* \d{0,1} -same as \d? 32

33 Any Thoughts, Ideas, Feedback? 33

34 Optimized Regular Expressions Because fast is elegant!

35 Optimize Regular Expressions GoodBetter (whiskey)(?:whiskey) Capture groups add unnecessary overhead and impact overall performance use them only when necessary. Capture groups add unnecessary overhead and impact overall performance use them only when necessary.

36 Optimize Regular Expressions GoodBetter splunk|splashspl(?:unk|ash) Try to “factor” on the left, when you can, while exposing required text. Less alternation is better. Try to “factor” on the left, when you can, while exposing required text. Less alternation is better.

37 Optimize Regular Expressions GoodBetter (?:aussie$|gypsie$)(?:aus|gyp)sie$ Try to “factor” on the right when input text is close to end of the line. Most regex engines will anchor at end of line when “$” is present. Try to “factor” on the right when input text is close to end of the line. Most regex engines will anchor at end of line when “$” is present.

38 Optimize Regular Expressions GoodBetter 0{3,7}0000{0,4} Typically exposing required or literal text makes the engine execute the regex faster Typically exposing required or literal text makes the engine execute the regex faster

39 Optimize Regular Expressions GoodBetter (.)*.* Useless parenthesis add unnecessary overhead. As above, use them only when necessary. Useless parenthesis add unnecessary overhead. As above, use them only when necessary.

40 Optimize Regular Expressions GoodBetter matty[:]matty: The character class/set (indicated by []) will add unnecessary overhead when not needed. The character class/set (indicated by []) will add unnecessary overhead when not needed.

41 Optimize Regular Expressions GoodBetter ^genti|^collar^(?:genti|collar) Anchoring the regex at the beginning of the line will result in improved performance with most regex engines. Anchoring the regex at the beginning of the line will result in improved performance with most regex engines.

42 Optimize Regular Expressions GoodBetter delaney$|connery$ (delaney|connery)$ I said, anchor the regex!

43 Optimize Regular Expressions GoodBetter ^src.*:^src[^:]*: Using a negated character class/set instead of lazy/greedy quantifiers will typically result in faster regexes. Lazy/greedy quantifiers will make the regex engines backtrack which ultimately impacts overall performance. Using a negated character class/set instead of lazy/greedy quantifiers will typically result in faster regexes. Lazy/greedy quantifiers will make the regex engines backtrack which ultimately impacts overall performance.

44 Optimize Regular Expressions GoodBetter bride|brianbri(?:de|an) Full alternation is more expensive than partial alternation. Also, in this case the regex engine will alternate only AFTER ‘bri’ has been matched. Full alternation is more expensive than partial alternation. Also, in this case the regex engine will alternate only AFTER ‘bri’ has been matched.

45 Optimize Regular Expressions GoodBetter (?:edu|com|net|…)(?:com|edu|net|…) Leading the engine to a match by placing the most popular match first may result in faster execution in some engines. Leading the engine to a match by placing the most popular match first may result in faster execution in some engines.

46 Optimize Regular Expressions GoodBetter ^.*(answer) ^.{42}(answer) Specifying an exact position inside the string and leading the engine to a match, will help improve performance drastically compared to using a simple greedy/lazy quantifier. Specifying an exact position inside the string and leading the engine to a match, will help improve performance drastically compared to using a simple greedy/lazy quantifier.

47 Optimize Regular Expressions GoodBetter.*?a ^.*a If ‘a’ is near the end of the input string will match faster as less backtracking will be required. If ‘a’ is near the end of the input string will match faster as less backtracking will be required.

48 Optimize Regular Expressions GoodBetter.*a^.*?a If ‘a’ is near the beginning of the input string the regex engine will match faster. If ‘a’ is near the beginning of the input string the regex engine will match faster.

49 Optimize Regular Expressions GoodBetter :[^:]*::[^:]*+: Ex. in ‘ :destination’ the second regex fails faster.

50 Optimize Regular Expressions GoodBetter :[^:]*::(?>[^:]*): Same as above, using different notation. Explanation: Atomic grouping or possessive quantifiers instruct the regex engine not to keep the states captured by * or + therefore preventing it from unsuccessfully backtracking and in turn failing faster. Same as above, using different notation. Explanation: Atomic grouping or possessive quantifiers instruct the regex engine not to keep the states captured by * or + therefore preventing it from unsuccessfully backtracking and in turn failing faster.

51 Python for the Masses


Download ppt "Copyright © 2013 Splunk Inc. Regex is Fun David Clawson SplunkYoda."

Similar presentations


Ads by Google