Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text manipulation  Suppose you want to build a web-page which will always contain the latest sports headlines collected from several newspaper websites.

Similar presentations


Presentation on theme: "Text manipulation  Suppose you want to build a web-page which will always contain the latest sports headlines collected from several newspaper websites."— Presentation transcript:

1 Text manipulation  Suppose you want to build a web-page which will always contain the latest sports headlines collected from several newspaper websites  You might, for example, wish to include the Guardian’s sports headlines on your page

2

3 Adding these headlines manually You would have to access the source of the Guardian page

4 You would then have to find the text which defines the headlines Analyse it And copy the relevant bits into the HTML for your own web-page

5 Examining it, we find that the source contains one HTML table for each sport in the list of top stories Here is the table for the tennis headlines on the page seen earlier: Murray magic books semi spot Tennis: The biggest win of his career to-date saw Andy Murray stun Robby Ginepri and reach the last four at the Thailand Open. Tough home Davis Cup tie for GB More tennis

6 Here is the text which defines the main tennis headline on the page shown earlier: Murray magic books semi spot Tennis: The biggest win of his career to-date saw Andy Murray stun Robby Ginepri and reach the last four at the Thailand Open.

7 To get this story onto your own web-page you could then copy the relevant HTML segment into the source code for your web-page But … … doing this manually is very labour- intensive We ought to automate the complete task

8 Adding headlines automatically To add headlines automatically, you would have to write a program which would –Download the source code for the Guardian page –Analyse this source code to extract the appropriate text –Add the relevant text to source code for your own web-page

9 Adding headlines automatically Later, we will see how to download page sources from other websites Now, we will focus on the issue of text analysis

10 Regular Expressions Regular expression technology provides a convenient way of searching string for patterns of interest

11 Regular expressions (contd.) Example regular expression: /ab*c/ this searches the target string for substring(s) that comprise “an a followed by zero or more instances of b followed by by a c” It will match any of the following substrings: ac abc abbc abbbc ….

12 Using regular expressions in PHP Regular expressions are supported in several languages, including PHP PHP provides a group of pre-defined functions for using them For now, we will focus on just one of these, the preg_replace function

13 The preg_replace function Format of call: preg_replace (regexp, replacement, subject [, int limit]) This function returns the result of replacing substrings in subject which match regexp with replacement The number of matching substrings which are replaced is controlled by the optional parameter limit An example application is on the next slide

14 Regular expressions (contd.) PHP code Resultant output is myString is xyzacklmabbcpqrabbbbbcstu myString is now xyz_klm_pqr_stu

15 Using the limit parameter in preg_replace PHP code Resultant output is myString is xyzacklmabbcpqrabbbbbcstu myString is now xyz_klmabbcpqrabbbbbcstu

16 Meta-characters We have seen that certain characters have a special meaning in regular expressions: –the example on the last few slides used the * character which means “0 or more instances of the preceding character or pattern” These are called meta-characters Other meta-characters are listed on the next slide

17 The meta-characters include: the * character which means “0 or more instances of preceding” the + character, which means “1 or more instances of preceding” the ? character, which means “0 or 1 instances of preceding” the { and } character delimit an expression specifying a range of acceptable occurrences of the preceding character Examples: {m} means exactly m occurences of preceding character/pattern {m,} means at least m occurrences of preceding char/pattern {m,n} means at least m, but not more than n, occurrences of preceding char/pattern Thus, {0,} is equivalent to * {1,} is equivalent to + {0,1} is equivalent to ?

18 Regular expressions (contd.) Further meta-characters are: the ^ character, which matches the start of a string the $ character, which matches the end of a string the. character which matching anything except a newline character the [ and ] character starts an equivalence class of characters, any of which can match one character in the target string the ( and ) characters delimit a group of sub-patterns the | character separates alternative patterns

19 Regular expressions (contd.) Example expression: /^a.*d$/ this matches the entire target string provided the target string starts with an a, followed by zero or more non-newline characters, and ends with a d An example application is on the next slide

20 Example application PHP code Resultant output is myString1 is abcdefghijklmnopqrstuvd myString1 is now _ myString2 is xabcdefghijklmnopqrstuvd myString2 is now xabcdefghijklmnopqrstuvd

21 Regular expressions (contd.) Example expression: /^a.{2,5}d$/ this replaces the entire target string with “x”, provided the target string starts with an a, followed by between two and five non-newline characters, and ends with a d An example application is on the next slide

22 Regular expressions (contd.) PHP code Resultant output is myString1 is adabbbbccccaaaabbbbccccd myString1 is now adabbbbccccaaaabbbbccccd myString2 is afghd myString2 is now _

23 Regular expressions (contd.) Example regular expression: /(abc){2,5}d/ this matches sub-string(s) in the target that comprise “between 2 and 5 repeats of the pattern abc followed by a d” An example application is on the next slide

24 Regular expressions (contd.) PHP code Resultant output is myString is klmabcabcabcdpqrabcdklmabcabcabcabcdxyz myString is now klm_pqrabcdklm_xyz

25 Regular expressions (contd.) Example regular expression: /(foo|bar)/ this matches sub-strings foo or bar An example application is on the next slide

26 Regular expressions (contd.) PHP code Resultant output is myString is abcfoodefbarghi myString is now abc_def_ghi

27 Regular expressions (contd.) Although some characters have special meanings in regular expressions, we may, sometimes, just want to use them to match themselves in the target string We do this by escaping them in the regular expression, by preceding them with a backslash \ Example regular expression: /^a\^+.*d$/ this matches the entire target string, provided the target string starts with an a, followed by one or more carat characters, followed by zero or more non- newline characters, and ends with a d An example application is on the next slide

28 Example application PHP code Resultant output is myString1 is adabbbbcabbcabced myString1 is now adabbbbcabbcabced myString2 is a^^^abbbbcabbcabceed myString2 is now _

29 Regular expressions (contd.) As mentioned earlier, the [ and ] characters have a special meaning in regular expressions –they delimit an equivalence class of characters, any one of which may be used to match one character in the target string Example regular expression: /a[KLM]b/ replaces any substring comprising “the letter a followed by one of the three letters KLM, followed by the letter b”

30 Regular expressions (contd.) The ^ character has a special meaning when used as the first character between [ and ] characters; this meaning is different from its special meaning when used outside the [ and ] characters –when used as the first character between the [ and ] characters, the ^ character specifies the complement of the equivalence class that would have been specified if its were absent Example regular expression: /a[^KLM]b/ replaces any substring comprising “the letter a followed by any single letter that is not one of KLM, followed by the letter b”

31 Regular expressions (contd.) The - character also has a special meaning when used between [ and ] characters: –it is used to join the start and end of a sequence of characters, any one of which may be used to match one character in the target string Example regular expression: /a[0-9]b/ matches any substring comprising “the letter a followed by one digit, followed by the letter b”

32 Regular expressions (contd.) Example regular expression: / %[a-fA-F0-9]/ matches any substring comprising “an % followed by a hexadecimal digit”

33 Regular expressions (contd.) Certain escape sequences also have a special meaning in regular expressions. They define certain commonly used equivalence classes of characters: \w is equivalent to [a-zA-Z0-9_] \W is equivalent to [^a-zA-Z0-9_] \d is equivalent to [0-9] \D is equivalent to [^0-9] \s is equivalent to [ \n\t\f\r] \S is equivalent to [^ \n\t\f\r] \b denotes a word boundary \B denotes a non-word boundary Note the SP characters in the meaning of \s and \S, that is the white-space equivalence includes SP Byt the way, \f is formFeed and \r is carriageReturn

34 Regular expressions (contd.) Example regular expression: / %\d\d\d\D/ matches any substring comprising “an % followed by three decimal digits, followed by a non-digit” Example regular expression: / \s\w\w\s/ matches any substring comprising “a white-space character, followed by two word characters, followed by another white-space character”

35 Regular expressions (contd.) PHP code Resultant output is myString is This is not an apple myString is now This_not_apple

36 Regular expressions (contd.) The standard quantifiers are all "greedy” – they match as many occurrences as possible without causing the pattern to fail. It is possible to make them “frugal” –that is, make them match the minimum number of times necessary We do this by following the quantifier with a "?" *? Match 0 or more times, preferably only 0 +? Match 1 or more times, preferably only 1 time ?? Match 0 or 1 time, preferably only 0 {n}? Match exactly n times {n,}? Match at least n times, preferably only n times {n,m}? Match at least n but not more than m times, preferably only n times

37 Regular expressions (contd.) PHP code Resultant output is myString1 is abcabcabcabc myString1 is now x myString2 is abcabcabcabc myString2 is now xx What is going on here? See next slide for contrast

38 Regular expressions (contd.) PHP code Resultant output is myString1 is abcabcabcabc myString1 is now x myString2 is abcabcabcabc myString2 is now xabcabc Discussion of contrast with previous slide...

39 A digression Before proceeding to further regexp concepts, let’s look at applying to HTML manipulation what we have already seen

40 Example task Suppose we have the following HTML wine f12 cheese Suppose we want to eliminate from the list any list item whose content comprises only non-digits That is, we want the HTML to become f12

41 Regular expressions (contd.) PHP code Resultant output is myString is  wine  f12  cheese myString is now  f12

42 Seeing the raw-HTML Suppose we want to see the raw HTML in our output That is, suppose we wanted to see myString is wine f12 cheese myString is now f12 We would have to replace all occurrences of < with < We could use regular expressions for this but, –the string to be replaced is a constant –so we can use a simpler technology

43 Regular expressions (contd.) PHP code Now the resultant output is myString is wine f12 cheese myString is now f12

44 Suppose we want to replace every list item with the fixed phrase listItem That is, we wanted to see this output myString is wine f12 cheese myString is now listItem listItem listItem

45 Regular expressions (contd.) Suppose we try this Resultant output is myString is wine f12 cheese myString is now listItem What is wrong? We need to make the + quantifier ungreedy

46 Regular expressions (contd.) We must do this Resultant output is myString is wine f12 cheese myString is now listItem listItem listItem

47 End of digression Back to regular expressions...

48 Regular expressions (contd.) -- remembering subpattern matches When a is being matched with a target string, substrings that match sub-patterns can be remembered and re-used later in the same pattern Sub-patterns whose matching substrings are to be remembered are enclosed in parentheses The sub-patterns are implicitly numbered, starting from 1 and their matching substrings can then be re-used later in the pattern by using back-references like \1 or \2 or \3 However, to get the backslash, we need to escape it, so we must type \\1 or \\2 or \\3 in our regular expressions

49 Using back-references (contd.) PHP code Resultant output is myString1 is klmAklmAAklmABklmBklmBBklm myString1 is now klmAklm_klmABklmBklm_klm

50


Download ppt "Text manipulation  Suppose you want to build a web-page which will always contain the latest sports headlines collected from several newspaper websites."

Similar presentations


Ads by Google