Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding the needle(s) in the textual haystack

Similar presentations


Presentation on theme: "Finding the needle(s) in the textual haystack"— Presentation transcript:

1 Finding the needle(s) in the textual haystack http://www.flickr.com/photos/seadigs/4810947743/sizes/z/in/photostream/

2 Consider the text above. How would you identify … Proper names? … Email addresses? … Dates? From: Gow, Joe Subject: Reminder About Open Forums Today Date: March 25, 2011 8:44:08 AM CDT Bcc: personnel@uwlax.edu Hello, everyone. I just wanted to send a quick reminder about the two campus wide Open Forums we're holding today from 2 to 3 and 3 to 4 p.m. in the Cleary Center. I'll host the first session from 2 to 3, and we'll cover any topics you'd like to discuss. Then from 3 to 4 Vice Chancellor Bob Hetzel will lead a conversation about the plans for a new Cowley Science Building. Please join us! Thanks, Joe Joe Gow, Chancellor University of Wisconsin-La Crosse From: Gow, Joe Subject: Reminder About Open Forums Today Date: March 25, 2011 8:44:08 AM CDT Bcc: personnel@uwlax.edu Hello, everyone. I just wanted to send a quick reminder about the two campus wide Open Forums we're holding today from 2 to 3 and 3 to 4 p.m. in the Cleary Center. I'll host the first session from 2 to 3, and we'll cover any topics you'd like to discuss. Then from 3 to 4 Vice Chancellor Bob Hetzel will lead a conversation about the plans for a new Cowley Science Building. Please join us! Thanks, Joe Joe Gow, Chancellor University of Wisconsin-La Crosse

3 What do you think of when you see the following? MM/DD/YYYY This is a (string) pattern. Are there different patterns for this same thing? How would you describe the pattern of a credit card number?

4 Regular expressions are “formulas” for string patterns. Regular expressions follow a standard notation. Regular expressions can be used in various computer applications and programming languages. Applying a regular expression to a string (piece of text) is called pattern matching. - The regular expression might match the string (or part of it) or it might not.

5 Regular expressions use a standard pattern language. Any (non-meta) character is a pattern. The character pattern represents itself. The '.' (period) is a pattern. The period (a meta character) pattern represents "any character" If A and B are both patterns, then so are AB : This represents the pattern A followed by pattern B F. matches Fa FR and F3 but not fa or aF A|B : This represents either the pattern A or the pattern B P|Q matches P and Q but not R Parentheses are special; they form a pattern group. Anything in parenthesis is a group. A group is one "thing". (red|blue) fish matches what strings?

6 (1|2|3|4|5|6|7|8|9|10|11|12):(0|1|2|3|4|5)(0|1|2|3|4|5|6|7|8|9) How would you write an expression for the time on a digital 12-hour clock? 1|2|3|4|5|6|7|8|9|10|11|12 A regular expression matching any possible minute: (0|1|2|3|4|5)(0|1|2|3|4|5|6|7|8|9) [HINT: Let’s divide & conquer] A regular expression matching any possible hour: A regular expression matching any possible time:

7 Quantifiers are used to allow and constrain repetitions. If re is a regular expression (pattern), then so are: re * represents zero or more repetitions of re re + represents one or more repetitions of re re ? represents zero or one occurrences of re re { n } represents exactly n repetitions of re ( n is some positive integer) re { m, n } represents at least m and no more than n repetitions of re ( n, m are positive integers, m ≤ n ) Write a regular expression for Social Security Numbers 123-45-6789

8 Text I sometimes wonder if the manufactures of foolproof items keep a fool or two on their payroll. Patten: o{2}1?

9 Some characters have special meaning in regular expressions, and others have no printable form. Such characters can still be represented using a 2-character notation, known as an escape code. \+ represents + \. represents. \n represents the new line character The same technique works for * ? ( ) { } [ ] \ ^ $ | \t represents the tab character \r represents the carriage return character \v represents the vertical tab character \f represents the form feed character

10 There are also two “location” symbols. ^ matches the start of a new line, including right after \n $ matches the end of a new line, including right before \n

11 (snow|rain)(flake|drop) g(rr|ee)* W.* W B\.C\. ^Right now.$ ^Right now.\$

12 Square brackets enclose a character class (a set of characters). The class will match any one character from the set. Within brackets…  specific characters can be listed  ranges are denoted using - Examples [aDb] matches a or D or b and nothing else [c-e] matches c or d or e and nothing else [a-z] matches any lowercase letter and nothing else [a-zA-Z0-9] matches any alphabetic or numeric symbol [a+*] matches a or + or * and nothing else

13 Which of the following match [a-z][0-9]* abc 1z93 a-9 Which of the following match [0-9]*[02468] 03 9929 354 Give a pattern for social security numbers using character classes.

14 Create a regular expression to match phone numbers. The phone numbers can take on the following forms: 800-555-1212 800 555 1212 800.555.1212 1-800-555-1212 800-555-1212-1234 800-555-1212x1234

15 Divide and conquer Note that each phone number has at most four parts. prefix (the number 1) area code trunk (first three digits) rest (next 4 digits) extension (last digits. May be between 1 and 4 in length) Consider defining each of these parts – what is the prefix? – what is the area code? – what is the trunk? – what is the rest? – what is the extension?

16 We need to 'conquer' by combining the solutions for the parts. Rules: – The prefix is optional – One of the following must occur between the prefix and the area code: space, comma, dash, period – One of the following must occur between the area code and the trunk: space, comma, dash, period – One of the following must occur between the trunk and the rest: space, comma, dash, period – An ‘x’ must occur between the rest and the extension.

17 Suppose the rules for some system are that a user name must begin with a capital letter, followed by lowercase letters and/or dashes and/or periods. The length of user names are restricted to 3 to 16 characters. Examples Dave D.-riley Rdave Invalid dave doesn’t begin with a capital letter DDR3 capital letters and digits not permitted after first symbol R too short

18 Every computer network connection has a unique MAC address that is expressed as six numbers separated by colons. Each number consists of two hexadecimal digits. Examples 10:22:93:04:91:00 AF:0C:AA:ED:B7:21 Invalid 10:22:93:04:91 too short 10:22:013:04:91 numbers must be two digits long, not three AG:0C:AA:ED:B7:21 the letter “G” is not a hexadecimal digit

19 Internet addresses are referred to as IP numbers. A common address consists of four positive integers separated by periods. These integers must each be within the range of 0-255. Examples 1.01.001.0 255.255.255.255 193.24.17.2 Invalid 256.255.255.255 no number can be greater than 255 193.24.175. too few numbers 193.24:17.2 separators must be periods

20 An email address consists of two strings separated by a @ localString @ domainString localString – Must be one or more of the following characters: alphabetic, digits (0 through 9), or any of these !#$&’+-_/=?^`{|}~ – Periods are permitted but with the following restrictions: the first and last characters cannot be periods and there cannot be any consecutive periods. – Note: There is another unusual notation for selected characters only allowed inside double quotes, which we will ignore. domainString – Must be one or more of the following characters: alphabetic, digits, dashes or periods. – Alternately, the domain could be written as a pair of square brackets enclosing four numbers separated by periods, where each of the four numbers is a non-negative number of one to three digits. e.g., [138.93.200.0]


Download ppt "Finding the needle(s) in the textual haystack"

Similar presentations


Ads by Google