Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Extraction using Regular Expressions Shih-Pei Chen Project Manager, China Biographical Database, Harvard University The Digitization in the Humanities.

Similar presentations

Presentation on theme: "Text Extraction using Regular Expressions Shih-Pei Chen Project Manager, China Biographical Database, Harvard University The Digitization in the Humanities."— Presentation transcript:

1 Text Extraction using Regular Expressions Shih-Pei Chen Project Manager, China Biographical Database, Harvard University The Digitization in the Humanities Workshop @ Rice University April 5-7, 2013

2 Downloads for today Slides and sample texts (a package) – On OWL-Space Text editor(s) – Mac users: please install TextWrangler – PC users: EmEditor, UltraEdit, or both CBDB Regex Machine – e515758 -- download the on this page e515758

3 The China Biographical Database – Modeling Life Histories – from anecdote to data Biography Prosopography Social Network Analysis Geospatial Analysis

4 Big Data What you are going to do with the great amount of texts on the Web? – Is there information you want to search? – Is there thing you want to analyze? CBDB experience: we use regular expressions to extract biographical data from thousands of historical records (in their full texts)

5 What regular expressions can do for you Beyond keyword search – Search for written variations – Search for patterns – Search and replace => tagging You don’t have to learn programming in order to use regular expressions – Just use a text editor which supports regex

6 Today Part 1: Learn regular expressions – Hands on exercises of matching regexes against some texts in a text editor. Part 2: A real play – Using regexes + Search and Replace in a text editor Part 3: CBDB Regex Machine – Using a graphical user interface to design regexes and test them against a text. Tagging the matches in XML tags.


8 Regular expressions Is a powerful way of describing patterns of strings You describe the pattern, the machine matches it against the text (a string of letters, digits, and symbols)

9 Automata Imagine a belt sending characters in line: The string in line (the input): abcde – It can match this pattern: abcde – It can also match this pattern: bc (but only the substring “bc” in the input will be matched – partial match) a a b b c c d d e e abcde?

10 Comparing the input against the regex character by character a a b b c c d d e e Input: Regex: a a b b c c d d e e Match!

11 Comparing the input against the regex character by character a a b b c c d d e e Input: Regex: b b c c Match! Behind the scenes: The robot picks up a in the input, and finds that a does not match b, the first character in the regex. Then, the robot throws a out, and picks up the next character in the input, which is b. This time robot finds the two b’s match each other.

12 Comparing the input against the regex character by character a a b b c c d d e e Input: Regex: b b d d No match! ×

13 Switch to a good text editor Text editors which support regex – Windows: EmEditor or UltraEdit (both not free) – Mac: TextWrangler (free)

14 Regular expressions – the syntax What you can describe using regular expressions?

15 Characters Literal characters – abcde, bc, bd (string match) Non-Printable Characters – \t (tab), \r (carriage return), \n (line feed) – Line breaks: \r (Mac), \n (Unix), or \r\n (Windows) Special characters (reserved characters / metacharacters) – [ ] \ ^ $. | ? * + ( ) Examples come from:

16 Exercise #1 Download and install one of the above text editors. Download the “regex text.txt” file. Open it in your text editor. Call up the “Search” or “Find” function in your editor, and try the regexes in Exercise#1 to see which regexes can be matched.

17 Character Classes – what can appear at a certain position? gr[ae]y can match gray or grey – Characters in [ ] form a class (bag of characters) – gr[ae]y will not match graay nor graey ! Common character classes – [a-z], [A-Z], [a-zA-Z], [0-9] Exercise#2 g g r r a a e e y y Input: Regex: g g r r [ae] y y

18 Shorthand Character Classes \d (digit) : shorthand of [0-9] \D (non-digit character) \w (word character): [A-Za-z0-9_] \W (non-word character) \s (whitespace character): [ \t\r\n] (white space, tab, carriage return, line feed) \S (non-whitespace character)

19 Negated Character Classes Any character except these – [^aeiou] : not one of a, e, i, o, u – [^\d] : not digit – [^\s] : not white space

20 Dot.. can match any single character (almost) – Except the newline character =>. is shorthand for [^\n] (Unix), [^\r] (Mac), [^\r\n] (Windows) Exercise#3

21 Optional and Repeat operators 3 operators for expressing repentance – ? : zero or one time (optional) – + : repeat for one or more times – * : repeat for zero or more times Repeat certain times: – \d{1,4} : one to four digits – \d{1,} : one digit or more (EQ to \d+ ) Exercise#4

22 Alternation (list of words) Useful when you have a list of words, and you want to find the occurrence of each – cat|dog|mouse|fish : find any one of the four – regex|regular expression : find either regex or regular expression Exercise#5 Examples come from:

23 Capturing writing variations Suppose you want to find all the occurrences mentioning regular expressions, but it can be written as “regular expression(s)” or “regex(es)”. Use this pattern to find them all: reg(ular expressions?|ex(es)?) Examples come from:

24 What can regular expressions do for you Provide better full-text search – Find a word without worrying its variations – Find specific info written in regular forms: dates, phone numbers, email addresses, HTML/XML tags, quotes, all capital abbreviations… – Find two words near each other Perform formatting tasks toward a text Automate tagging

25 Find information written in regular forms Exercise #6: finding dates as of mm/dd/yy – \d\d.\d\d.\d\d – \d\d[- /.]\d\d[- /.]\d\d – [0-1]\d[- /.][0-3]\d[- /.]\d\d – (0[1-9]|1[012])[- /.]([012]\d|3[01])[- /.]\d\d Exercise #7: finding texts within double quotes – ".*” – "[^"\r\n]*” – "[^"]*" Examples come from:

26 Grouping and back references Exercise #8: finding HTML/XML tags – ]*>.*? – 04/07/13 () : capturing group \1: back reference the 1 st captured group – If there are more than 1 pairs of (), use \2, \3, etc. – The whole matched string is referenced as \0 Examples come from:

27 Formatting task Trimming unnecessary white spaces – Replace [ \t]{2,} with a single space – Delete leading whitespace within a line: replace ^[ \t]+ with nothing (empty string) – Trim trailing whitespace of a line: replace [ \t]+$ with nothing (empty string) Transform a text to a list of words – Append a line break after each word – Replace uppercase letters -> lowercase – Replace punctuation symbols with nothing – Rount frequency of each word in MS Excel Examples come from:

28 Automate tagging Idea: Find dates via some regex, and then surround each of the matches with tags: some date Replace our date pattern: (0[1-9]|1[012])[- /.]([012]\d|3[01])[- /.]\d\d with : \0 Try it in the date exercise Once you can tag useful info in a text, it will be easy to pull them out.

29 Resources for regular expressions – Profhacker article: “Finding the Women of Heimskringla with Regular Expressions” – women-of-heimskringla-with-regular-expressions/38631 women-of-heimskringla-with-regular-expressions/38631 → Digital humanities article: –


31 Our texts today Get familiar with it Use regex to do some search Search in files Then use the techniques to prepare the text for Regex Machine

32 Texts for today: Old Bailey Proceedings You can find samples in today’s package under “Old Bailey Proceedings” Or, you can download them on your own: – select all and copy – paste it to a text editor – save it as UTF-8 without BOM (byte order mark)

33 Old Bailey’s Proceeding: the HTML presentation Old Bailey’s Proceeding: the HTML presentation

34 Text form

35 Try some search Search for: t\d{8}-\d{3} Replace it with: \0

36 Exercise: Preparing your text in a specific format (to feed to some software)

37 How to convert? Observe! Goal: to make each case a single line Patterns? Every case begins with a line of “Reference Number” and ends before the next “Reference Number” Got to remove all the line breaks Tricky things: does the text contain XML reserved characters &,,…

38 Conversion Steps: Search and Replace + regexes Replace the XML reserved characters: – & => & % => % – > > => < Get rid of “285.”: ^\d{3}\. => nothing (empty string) Replace all the line breaks (\r, \n, \r\n) with nothing Reassign the line breaks by “Reference number:” – Reference number: => \rReference number: Optional: Get rid of “See original” The order above is crucial

39 What does the Regex Machine do? A graphical user interface (GUI) that enables people who do not have programming skills to – graphically design patterns – match them against a corpus of texts – see results immediately via a user-friendly color- coding scheme (quick feedback) – export to XML => automates (part of) the tagging procedure 3/23/201339 Credit: Elif Yamagil

40 Downloading CBDB RegexMachine Regex Machine (on CBDB website) – geid=icb.page515758 -- download the on this page geid=icb.page515758 Prerequisites: – Make sure your machine has Java Runtime Enrironment (JRE) installed. If not, you can download it here:

41 Run the Regex Machine Double click the CBDBRegexMachine.jar In the “Select Your User Director” window, select the folder where you put your text files. – Tip: don’t double click the folder! Single click is all you need.

42 GUI List of active regex List of “terms” Your Text Info Box 42

43 Open the text we just prepared File  Open. Select your text file.

44 Create Active Regexes First regex: capture the reference number – Example: t18500107-285 – Pattern: t\d{8}-\d{3} – It’s always good to test it first in a text editor Create it in Regex Machine – Think first: is it one unit? Does it contain diff parts?

45 1. Click 2. Click 3. Fill in your regex and give it a name

46 4. Give the whole regex a name. Then choose a color! 5. Click on the Regex. Matches are highlighted!

47 Export to XML 7. Set records per file to 1000 6. File  Export 8. Then an XML should be generated in the same folder of the text file!

48 XML header added. Each line is surrounded by the tag with line number. The number is now tagged with the Handle you specified!

49 Try another regex Second regex: capture the “Reference number:” and the number – Example: Reference Number: t18500107-285 – Pattern: Reference Number: t\d{8}-\d{3} Create it in Regex Machine – Think first: Do you want it to be tagged as a whole? Should the match contain diff parts?

50 Using multiple groups in an Active Regex Add another Active Regex. Create two groups: Group #1: Reference Number: Group #2: t\d{8}-\d{3}

51 Group #1 Group #2 Capture this group! Group #2 Capture this group!

52 Click on the new one to highlight the matched strings. Then click Move Up. Export to XML. Click on the new one to highlight the matched strings. Then click Move Up. Export to XML.

53 The whole string is tagged, and the number part is “captured” as an attribute!

54 What else to capture? Name of defendant(s) Verdict: guilty or not guilty, age, punishment Any patterns observed?

55 Pattern for verdicts – If NOT GUILTY, normally nothing more. – If GUILTY, normally has Aged \d{2} followed by the punishment. – There can be more than one verdicts in each record (if more than one defendant)

56 NOT GUILTY 1: Give the whole regex a name. It will become the XML tag name surrounding the entire matched string 2: Give the pattern as the exact text “NOT GUILTY” Handle: give it a name Capturing group: The name here will be used as the attribute name of the XML tag. The captured value will become the value of the attribute.

57 GUILTY GUILTY.*Aged ?\d{1,3}[^—]*—.* – Group #1: guilty or not => GUILTY – Group #2: age => \d{1,3} – Group #3: punishment =>.* – Something in between the desired groups Between group 1 & 2:.*Aged ? Between group 2 & 3: [^—]*— Need to create 5 groups!




61 Export to XML You can then use a browser to open it (more readable). You can further use an XML editor to correct mistakes (validation).

62 Open the XML in Excel *Please note that not every XML can be well interpreted in Excel. It’s due to the capability of handling different data structure: Excel is for tabular data, and XML is for trees – much more flexible. *Also, Mac version MS Excel doesn’t read XML!

63 One last thing How about the names of the defendants? What is pattern? – The names are right after the reference number. – They are all capital. – There can be more than 1 names. In that case, a mixture of space, comma, and “and” are used to connect each name.

64 Test this pattern in a text editor: – Reference Number: ?[a-z]\d{8}-\d{3}\s+([A-Z' ]+),?(?: ?([A-Z' ]+),)*(?: ?and([A-Z' ]+))? – What does it capture? Break into groups: – refNo: [a-z]\d{8}-\d{3} – First defendant: ([A-Z' ]+) – Second (or more) defendant: (?: ?([A-Z' ]+),)* – Last defendant: (?: ?and([A-Z' ]+))?




68 Good! Some problem


70 A real extraction project on local gazetteers – by Adam Mitchell Raw descriptions written in the gazetteers (extracted) Source Date Disaster type Location Disaster types: Earthquakes and fires; Epidemics and Insect Plagues; Snow, Ice, and Tempests; Floods and Droughts; Famines, Hyperinflation, and Relief Efforts.

71 3/23/201371 Collect data at the local levels and then aggregate

72 Reflections on using the Regex Machine Carefully designing your regex and groups Think ahead what you want in XML Tuning regexes can take dozens of hours It’s difficult to find regexes to capture them all -- there are always left outs, exceptions, etc. Keep in mind the cost of tuning “perfect” regexes.

73 Put regular expressions in a bigger context Using regex to search / capture data of interest – only when the piece of information is written in regular patterns What if there are no regular patterns? How we can teach machines to identify important information among a corpus of texts? – If it’s location names, person names => Named Entity Recognition (NER) – If it’s concepts => topic modeling, … – Text mining, machine learning, … and more

74 Conclusion Hope to let you understand what regex is Hope to give you some hands on experience in using regexes against some texts Hope to give you some senses of what machines can deal with texts => Your imagination: you can begin to think about what texts are available and what you can do with them.


Download ppt "Text Extraction using Regular Expressions Shih-Pei Chen Project Manager, China Biographical Database, Harvard University The Digitization in the Humanities."

Similar presentations

Ads by Google