Presentation is loading. Please wait.

Presentation is loading. Please wait.

Science: Text and Language Dr Andy Evans. Text analysis Processing of text. Natural language processing and statistics.

Similar presentations


Presentation on theme: "Science: Text and Language Dr Andy Evans. Text analysis Processing of text. Natural language processing and statistics."— Presentation transcript:

1 Science: Text and Language Dr Andy Evans

2 Text analysis Processing of text. Natural language processing and statistics.

3 Processing text: Regex Java Regular Expressions java.util.regex Regular expressions: Powerful search, compare (and replace) tools. (other types of regex include direct replace options – in java regex these are separate methods)

4 Regex Standard java: if ((email.indexOf(“@” > 0) && (email.endsWith(“.org”))) { return true; } Regex version: if(email.matches(“[A-Za-z]+@[A-Za-z]+\\.org”)) return true;

5 Example components [abc] a, b, or c (simple class) [^abc] Any character except a, b, or c (negation) [a-zA-Z] a through z, or A through Z, inclusive (range) [a-d[m-p]] a through d, or m through p: [a-dm-p] (union) [a-z&&[def]] d, e, or f (intersection) [a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction) [a-z&&[^m-p]] a through z, and not m through p: [a-lq-z] (subtraction). Any character (may or may not match line terminators) \d A digit: [0-9] \D A non-digit: [^0-9] \s A whitespace character: [ \t\n\x0B\f\r] \S A non-whitespace character: [^\s] \w A word character: [a-zA-Z_0-9] \W A non-word character: [^\w] ?Once or not at all * Zero or more times + One or more times

6 Matching Find all words that start with a number. Pattern p = Pattern.compile(“\\d\\.*”); Matcher m = p.matcher(stringToSearch); while (m.find()) { String temp = m.group(); System.out.println(temp); }

7 Replacing replaceFirst(String regex, String replacement) replaceAll(String regex, String replacement)

8 Regex Good start is the tutorial at: http://docs.oracle.com/javase/tutorial/essential/regex/ Also Mehran Habibi’s Java Regular Expressions.

9 Natural Language Processing A large part is Part of Speech (POS) Tagging: Marking up of text into nouns, verbs, etc., usually based on the location in the text and other context rules. Often formulates these rules using machine-learning (of various kinds), training the program on corpora of marked-up text. Used for : Text understanding. Knowledge capture and use. Text forensics.

10 NLP Libraries Popular are: Natural Language Toolkit (NLTK; Python) http://www.nltk.org/ OpenNLP (Java) http://opennlp.apache.org/index.html

11 OpenNLP Sentence recognition and tokenising. Name extraction (including placenames). POS Tagging. Text classification. For clear examples, see the manual at: http://opennlp.apache.org/documentation.html

12 Other info Other than the Numerical Recipes books, the other classic texts are Donald E. Knuth’s The Art of Computer Programming Fundamental Algorithms Seminumerical Algorithms Sorting and Searching Combinatorial Algorithms But at this stage, you’re better off getting…

13 Other info Michael T. Goodrich and Roberto Tamassia’s Data Structures and Algorithms in Java. Basic java, arrays and list. Recursion in algorithms. Key mathematical algorithms. Algorithm analysis. Data storage structures (stacks, queues, hashtables, binary trees, etc.) Search and sort. Text processing. Graph/network analysis. Memory management.


Download ppt "Science: Text and Language Dr Andy Evans. Text analysis Processing of text. Natural language processing and statistics."

Similar presentations


Ads by Google