Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding the needle(s) in the textual haystack

Similar presentations


Presentation on theme: "Finding the needle(s) in the textual haystack"— Presentation transcript:

1 Finding the needle(s) in the textual haystack
Textual Patterns Finding the needle(s) in the textual haystack

2 Patterns Consider the text above. How would you identify
From: Gow, Joe Subject: Reminder About Open Forums Today Date: March 25, :44:08 AM CDT Bcc: Hello, everyone. I just wanted to send a quick reminder about the two campus wide Open Forums we're holding today from 2 to 3 and 3 to 4 p.m. in the Cleary Center. I'll host the first session from 2 to 3, and we'll cover any topics you'd like to discuss. Then from 3 to 4 Vice Chancellor Bob Hetzel will lead a conversation about the plans for a new Cowley Science Building. Please join us! Thanks, Joe Joe Gow, Chancellor University of Wisconsin-La Crosse Consider the text above. How would you identify … Proper names? … addresses? … Dates?

3 Patterns What do you think of when you see the following? MM/DD/YYYY
This is a (string) pattern. Are there different patterns for this same thing? How would you describe the pattern of a credit card number?

4 Regular Expressions Regular expressions are “formulas” for string patterns. Regular expressions follow a standard notation. Regular expressions can be used in various computer applications and programming languages. Applying a regular expression to a string (piece of text) is called pattern matching. - The regular expression might match the string (or part of it) or it might not. google codesearch uses this, but google does not Regular expressions are a notation for patterns of strings. Most single character symbols are regular expressions. Some special char has special meaning in the regular expression, so if you refer it, you need put backslash\ in front of it. Example, . Means match any char. \.

5 Regular Expression Notation
Regular expressions use a standard pattern language. Any (non-meta) character is a pattern. The character pattern represents itself. The '.' (period) is a pattern. The period (a meta character) pattern represents "any character" If A and B are both patterns, then so are AB : This represents the pattern A followed by pattern B F. matches Fa FR and F3 but not fa or aF A|B : This represents either the pattern A or the pattern B P|Q matches P and Q but not R Parentheses are special; they form a pattern group. Anything in parenthesis is a group. A group is one "thing". matches what strings? (red|blue) fish

6 Example How would you write an expression for the time on a digital 12-hour clock? [HINT: Let’s divide & conquer] A regular expression matching any possible hour: 1|2|3|4|5|6|7|8|9|10|11|12 A regular expression matching any possible minute: (0|1|2|3|4|5|6|7|8|9) (0|1|2|3|4|5|6|7|8|9) (0|1|2|3|4|5|6|7|8|9)-(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)-(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9) (0|1|2|3|4|5)(0|1|2|3|4|5|6|7|8|9) A regular expression matching any possible time: (1|2|3|4|5|6|7|8|9|10|11|12):(0|1|2|3|4|5)(0|1|2|3|4|5|6|7|8|9)

7 Repeating Patterns within Patterns
Quantifiers are used to allow and constrain repetitions. If re is a regular expression (pattern), then so are: re* represents zero or more repetitions of re re+ represents one or more repetitions of re re? represents zero or one occurrences of re re{n} represents exactly n repetitions of re (n is some positive integer) re{m,n} represents at least m and no more than n repetitions of re (n, m are positive integers, m ≤ n) (0|1|2|3|4|5|6|7|8|9){3}-(0|1|2|3|4|5|6|7|8|9){2}-(0|1|2|3|4|5|6|7|8|9){4} Write a regular expression for Social Security Numbers

8 Example Text I sometimes wonder if the manufactures of foolproof items keep a fool or two on their payroll. Patten: o{2}. This pattern matches ool. There are three matches in this example: ool, oof, ool

9 Escaped Characters Some characters have special meaning in regular expressions, and others have no printable form. Such characters can still be represented using a 2-character notation, known as an escape code. \+ represents + \. represents . The same technique works for * ? ( ) { } [ ] \ ^ $ | \n represents the new line character \t represents the tab character \r represents the carriage return character \v represents the vertical tab character \f represents the form feed character

10 Location Symbols There are also two “location” symbols.
^ matches the start of a new line, including right after \n $ matches the end of a new line, including right before \n

11 Sample Regular Expressions
(snow|rain)(flake|drop) g(rr|ee)* W.*W B\.C\. ^Right now.$ ^Right now.\$

12 Character Classes Square brackets enclose a character class (a set of characters). The class will match any one character from the set. Within brackets…  specific characters can be listed  ranges are denoted using - Examples matches a or D or b and nothing else [aDb] matches c or d or e and nothing else [c-e] Character classes is a way to abbreviate certain pattern alternatives. Each character class matches a string of length 1. [aDb] is same as a|D|b, but special char will no need \ inside the square bracket. matches any lowercase letter and nothing else [a-z] matches any alphabetic or numeric symbol [a-zA-Z0-9] matches a or or * and nothing else [a+*]

13 Examples Which of the following match [a-z][0-9]*
abc 1z93 a-9 Which of the following match [0-9]*[02468] 03 9929 354 Give a pattern for social security numbers using character classes. Social security number [0-9]{3}-[0-9]{2}-[0-9]{4}

14 Example 1: Phone Numbers
Create a regular expression to match phone numbers. The phone numbers can take on the following forms: x1234 [1-.]+80{2}[-. ]]5{3}[- ]1212[-x]1234

15 Example 1: Phone Numbers
Divide and conquer Note that each phone number has at most four parts. prefix (the number 1) area code trunk (first three digits) rest (next 4 digits) extension (last digits. May be between 1 and 4 in length) Consider defining each of these parts what is the prefix? what is the area code? what is the trunk? what is the rest? what is the extension?

16 Example 1: Phone Numbers
We need to 'conquer' by combining the solutions for the parts. Rules: The prefix is optional One of the following must occur between the prefix and the area code: space, comma, dash, period One of the following must occur between the area code and the trunk: space, comma, dash, period One of the following must occur between the trunk and the rest: space, comma, dash, period An ‘x’ must occur between the rest and the extension.

17 Example 2: User Name Suppose the rules for some system are that a user name must begin with a capital letter, followed by lowercase letters and/or dashes and/or periods. The length of user names are restricted to 3 to 16 characters. Examples Dave D.-riley Rdave Invalid dave doesn’t begin with a capital letter DDR3 capital letters and digits not permitted after first symbol R too short [A-Z][a-z.-]{2,15}

18 Example 3: MAC Address Every computer network connection has a unique MAC address that is expressed as six numbers separated by colons. Each number consists of two hexadecimal digits. Examples 10:22:93:04:91:00 AF:0C:AA:ED:B7:21 Invalid 10:22:93:04:91 too short 10:22:013:04:91 numbers must be two digits long, not three AG:0C:AA:ED:B7:21 the letter “G” is not a hexadecimal digit [0-9A-F]{2}\:[0-9A-F]{2}\:[0-9A-F]{2}\:[0-9A-F]{2}

19 Example 4: IPV4 Internet addresses are referred to as IP numbers. A common address consists of four positive integers separated by periods. These integers must each be within the range of Examples Invalid no number can be greater than 255 too few numbers 193.24:17.2 separators must be periods [0-255]\.[0-255]\.[0-255]\.[0-255]

20 Example 5: Email Addresses
An address consists of two strings separated by domainString localString Must be one or more of the following characters: alphabetic, digits (0 through 9), or any of these !#$&’+-_/=?^`{|}~ Periods are permitted but with the following restrictions: the first and last characters cannot be periods and there cannot be any consecutive periods. Note: There is another unusual notation for selected characters only allowed inside double quotes, which we will ignore. domainString Must be one or more of the following characters: alphabetic, digits, dashes or periods. Alternately, the domain could be written as a pair of square brackets enclosing four numbers separated by periods, where each of the four numbers is a non-negative number of one to three digits. e.g., [ ]


Download ppt "Finding the needle(s) in the textual haystack"

Similar presentations


Ads by Google