#  Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’ 0123456789012345.

## Presentation on theme: " Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’ 0123456789012345."— Presentation transcript:

 Text Manipulation and Data Collection

General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’ 0123456789012345 A successful man m

General Programming Practice A m m s m u m m m ma

General Programming Practice First, observe carefully and find any repetition Second, start with a small problem; find ‘m’ first. Third, construct a loop to find the first character ‘m’ What is the range of this loop (variable)? What is the terminate condition

General Programming Practice Fourth, think about what will happen if the condition meets, Fifth, find an additional loop if necessary What is the range of this loop (variable) What is the terminate condition? How many loops we need? Is there any nested loop?

Scenario Find all email addresses from a web page. username@gmail.com Find all hyperlinks from a web page. http://www.unc.edu/index.html Find a special gene from a long gene sequence. Find files ends with.doc,.txt, or.xls Replace a string in many files.

Find a substring If a search string is fixed; for example, find ‘unc.edu’ from the unc homepage, IN operator would be sufficient. If a search string is optional; for example, find ‘unc.edu’ or ‘ncsu.edu’, or ‘duke.edu’ from an web page Need to repeat multiple searches What about finding an email address? What about finding numbers? What about finding

Expression of Search String Wildcard character

Expression of Search String Finding a file which starts with ‘data_’ and has ‘.69’ in the middle of the string ls data_*.69* A wildcard string specifies a rule to find a string. How to find *.unc.edu or *.ncsu.edu or *.duke.edu We need stronger expression. Regular Expression

Introduction Regular expression (regex for short) A special text string for describing a search pattern. Think of regular expressions as wildcards with steroids Wildcard notations: *.txt to find all text files in a shell Test regex at http://regex101.comhttp://regex101.com Finding an email address in a text \b[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+\b Don’t worry if this make little sense to you.

Literal Characters Any character except a small list of reserved characters could be a regex. http://www.slideshare.net/mattcasto/introduction-to-regular-expressions-1879191

Literal Characters Literals will match characters in the middle of words.

Literal Characters Literals are case sensitive – capitalization matters!

Special Characters [ \ /^ \$. | ? * + ( ) ]

Special Characters You can match special characters by escaping them with a backslash

Special Characters Some characters, such as { and } are only reserved depending on context.

Non-Printable Characters Some literal characters can be escaped to represent non- printable characters.

Period The period character matches any single character.

Character Classes Used to match only one of the characters inside square braces.

Character Classes Hyphen is a reserved character inside a character class and indicates a range.

Character Classes Caret inside a character class negates the match.

Character Classes Normal special characters are valid inside of character classes. Only ] \ ^ and – are reserved.

Shorthand Character Classes \d – digit or [0-9] \w – word or [A-Za-z0-9_] \s – whitespace or [ \t\r\n] (space, tab, CR, LF)

Shorthand Character Classes \D – non-digit or [^\d] \W – non-word or [^\w] \S – non-whitespace or [^\s]

Repetition The asterisk (*) repeats the preceding character class 0 or more times.

Repetition The plus repeats the preceding character class 1 or more times.

Repetition The question mark repeats the preceding character class 0 or 1 times, in effect making it optional.

Anchors The caret anchor matches the position before the first character in a string.

Anchors The dollar sign anchor matches the position after the last character in a string.

Anchors The caret and dollar sign anchors match the start and end of the line if the engine has multi-line turned on (m option).

Anchors The \A and \Z shorthand character classes are like ^ and \$ but only match the start and end of the string even if the multi-line option is turned on.

Word Boundaries The \b shorthand character class matches position before the first character in a string (like ^) Position after the last character in a string (like \$) between two characters where one is a word character and the other is not

Word Boundaries The \B shorthand character class is the negated word boundary – any position between two word characters.

Alteration The pipe symbol delimits two or more character classes that can both match.

Alteration Alteration include any character classes.

Alteration Use parenthesis to group alternating matches when you want to limit the reach of alteration.

Eagerness Eagerness causes the order of alterations to matter.

Greediness Greediness means that the engine will always try to match as much as possible.

Laziness Laziness, or reluctant, modifies a repetition operator to only match as much as it needs to.

Limiting Repetition You can limit repetition with curly braces.

Limiting Repetition The second number can be omitted to mean infinite. Essentially, {0,} is the same as * and {1,} same as +.

Limiting Repetition A single number can be used to match an exact number of times.

Group Parenthesis makes a group and helps to retrieve substrings. Parenthesis can be nested.

Back References Parenthesis around a character set groups those characters and creates a back reference.

Named Groups Named groups let you reference matched groups by their name rather than just index.

Positive Lookahead When you want to match a pattern with certain conditions in neighborhood.

RegEx Summary Literals Character Classes: [ ] operator Repeat: * and + operator Negation: ^ Shorthanded expressions: \w, \s, \d, \W, \S, \D Capturing group: ( ) operator Back-reference: \1, \2, … Positive-negative lookahead (?=…) and (?!...)

Useful RegEx Matching a username: ^[a-z0-9_-]{3,16}\$ Matching a password: ^[a-z0-9_-]{6,`18}\$ Matching a hex value: ^#?([a-f0-9]{6}|[a-f0-9]{3})\$ Matching an email: ^([a-z0-9_.-]+)@([\da-z.-]+)\.([a-z.]{2,6})\$ Matching a url: ^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?\$ Matching an IP address: ^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0- 9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\$ Matching a html tag: ^ (.*) |\s+\/>)\$

Using RegEx in Python “re” package provides the handling of regular expression.

Using RegEx in Python There are two different functions, namely, search() and match()

Search() vs Match() “match()” function treats the regex after adding ^ and \$ at the front and the end of the regex pattern respectively.

Substitution Replacement based on the regex pattern

RegEx Modifiers Case in-sensitive matching: re.I Multi-line matching: re.M Makes a period match any character including a newline: re.S Use the Unicode character set: re.U

Problem Exon extraction

Download ppt " Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’ 0123456789012345."

Similar presentations