An Introduction to TokensRegex

Slides:



Advertisements
Similar presentations
CSCI 6962: Server-side Design and Programming Input Validation and Error Handling.
Advertisements

Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
Lecture 2 Introduction to C Programming
Introduction to C Programming
Data Types in Java Data is the information that a program has to work with. Data is of different types. The type of a piece of data tells Java what can.
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
An Introduction to Using Semgrex Chloé Kiddon. What is Semgrex? A java utility (in javanlp) for identifying patterns in Stanford JavaNLP SemanticGraph.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Regular Expressions in Java. Namespace in XML Transparency No. 2 Regular Expressions Regular expressions are an extremely useful tool for manipulating.
Regular Expressions in Java. Regular Expressions A regular expression is a kind of pattern that can be applied to text ( String s, in Java) A regular.
Lecture 3: Strings, Screen Output, and Debugging Yoni Fridman 7/2/01 7/2/01.
1 A Quick Introduction to Regular Expressions in Java.
Using regular expressions Search for a single occurrence of a specific string. Search for all occurrences of a string. Approximate string matching.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
CSci 142 Data and Expressions. 2  Topics  Strings  Primitive data types  Using variables and constants  Expressions and operator precedence  Data.
1 Character Strings and Variables Character Strings Variables, Initialization, and Assignment Reading for this class: L&L,
String Escape Sequences
TokensRegex August 15, 2013 Angel X. Chang.
Applications of Regular Expressions BY— NIKHIL KUMAR KATTE 1.
Lesson 3 – Regular Expressions Sandeepa Harshanganie Kannangara MBCS | B.Sc. (special) in MIT.
Last Updated March 2006 Slide 1 Regular Expressions.
1 Variables, Constants, and Data Types Primitive Data Types Variables, Initialization, and Assignment Constants Characters Strings Reading for this class:
Java Programming: Advanced Topics 1 Collections and Wealth of Utilities.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
Regular Expressions in.NET Ashraya R. Mathur CS NET Security.
2440: 211 Interactive Web Programming Expressions & Operators.
Review: Regular expression: –How do we define it? Given an alphabet, Base case: – is a regular expression that denote { }, the set that contains the empty.
Hello.java Program Output 1 public class Hello { 2 public static void main( String [] args ) 3 { 4 System.out.println( “Hello!" ); 5 } // end method main.
5 BASIC CONCEPTS OF ANY PROGRAMMING LANGUAGE Let’s get started …
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
Introduction to C Programming Angela Chih-Wei Tang ( 唐 之 瑋 ) Department of Communication Engineering National Central University JhongLi, Taiwan 2010 Fall.
Regular Expressions CSC207 – Software Design. Motivation Handling white space –A program ought to be able to treat any number of white space characters.
Regular Expression in Java 101 COMP204 Source: Sun tutorial, …
Regular Expressions.
Kirkwood Center for Continuing Education Introduction to PHP and MySQL By Fred McClurg, Copyright © 2015, Fred McClurg, All Rights.
Regular Expressions – An Overview Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in.
 2003 Jeremy D. Frens. All Rights Reserved. Calvin CollegeDept of Computer Science(1/8) Regular Expressions in Java Joel Adams and Jeremy Frens Calvin.
VBScript Session 13.
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Kirkwood Center for Continuing Education Introduction to PHP and MySQL By Fred McClurg, Copyright © 2010 All Rights Reserved. 1.
Module 6 – Generics Module 7 – Regular Expressions.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
Introducing Python CS 4320, SPRING Lexical Structure Two aspects of Python syntax may be challenging to Java programmers Indenting ◦Indenting is.
© 2004 Pearson Addison-Wesley. All rights reserved ComS 207: Programming I Instructor: Alexander Stoytchev
Operators and Expressions. 2 String Concatenation  The plus operator (+) is also used for arithmetic addition  The function that the + operator performs.
Unit 11 –Reglar Expressions Instructor: Brent Presley.
CSE 374 Programming Concepts & Tools Hal Perkins Fall 2015 Lecture 5 – Regular Expressions, grep, Other Utilities.
CGS – 4854 Summer 2012 Web Site Construction and Management Instructor: Francisco R. Ortega Chapter 5 Regular Expressions.
CC L A W EB DE D ATOS P RIMAVERA 2015 Lecture 7: SPARQL (1.0) Aidan Hogan
An Introduction to Regular Expressions Specifying a Pattern that a String must meet.
CSCI 1100/1202 January 14, Abstraction An abstraction hides (or ignores) the right details at the right time An object is abstract in that we don't.
Variable Variables A variable variable has as its value the name of another variable without $ prefix E.g., if we have $addr, might have a statement $tmp.
Chapter 2: Data and Expressions. Variable Declaration In Java when you declare a variable, you must also declare the type of information it will hold.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
CS0007: Introduction to Computer Programming Primitive Data Types and Arithmetic Operations.
1 Lecture 8 Shell Programming – Control Constructs COP 3353 Introduction to UNIX.
REGULAR EXPRESSION Java provides the java.util.regex package for pattern matching with regular expressions. Java regular expressions are very similar.
Strings, Characters and Regular Expressions
CSC 594 Topics in AI – Natural Language Processing
CSC 594 Topics in AI – Natural Language Processing
JAVA RegEx Manish Shrivastava 11/11/2018.
The Linux Command Line Chapter 7
Regular Expressions in Java
Regular Expressions in Java
Regular Expression in Java 101
Regular Expressions in Java
Presentation transcript:

An Introduction to TokensRegex Angel Xuan Chang May 30, 2012

What is TokensRegex? A Java utility (in Stanford CoreNLP) for identifying patterns over a list of tokens (i.e. List<CoreMap> ) Very similar to Java regex over Strings except this is over a list of tokens Complimentary to Tregex and Semgrex Be careful of backslashes Examples assumes that you are embedding the pattern in a Java String, so a digit becomes “\\d” (normally it is just \d, but need to escape \ in Java String)

TokensRegex Usage Overview TokensRegex usage is like java.util.regex Compile pattern TokenSequencePattern pattern = TokenSequencePattern.compile(“/the/ /first/ /day/”); Get matcher TokenSequenceMatcher matcher = pattern.getMatcher(tokens); Perform match matcher.match() matcher.find() Get captured groups String matched = matcher.group(); List<CoreLabel> matchedNodes = matcher.groupNodes();

Syntax – Sequence Regex Syntax is also similar to Java regex Concatenation: X Y Or: X | Y And: X & Y Quantifiers Greedy: X+, X?, X*, X{n,m}, X{n}, X{n,} Reluctant: X+?, X??, X*?, X{n,m}?, X{n}?, X{n,}? Grouping: (X)

Syntax – Nodes (Tokens) Tokens are specified with attribute key/value pairs indicating how the token attributes should be matched Special short hand to match the token text Regular expressions: /regex/ (use \/ to escape /) To match one or two digits: /\\d\\d?/ Exact string match: “text” (use \” to escape ”) To match “-”: “-” If the text only include [A-Za-z0-9_], can leave out the quotes To match December exactly: December Sequence to match date in December December /\\d\\d?/ /,/ /\\d\\d\\d\\d/

Syntax – Token Attributes For more complex expressions, we use [ <attributes> ] to indicate a token <attributes> = <basic_attrexpr> | <compound_attrexpr> Basic attribute expression has the form { <attr1>; <attr2>…} Each <attr> consist of <name> <matchfunc> <value> No duplicate attribute names allowed Standard names for key (see AnnotationLookup) word=>CoreAnnotations.TextAnnotation.class tag=>CoreAnnotations.PartOfSpeechTagAnnotation.class lemma=>CoreAnnotations.LemmaAnnotation.class ner=>CoreAnnotations.NamedEntityTagAnnotation.class

Syntax – Token Attributes Attribute match functions Pattern Matching: <name>:/regex/ (use \/ to escape /) [ { word:/\\d\\d/ } ] String Equality: <attr>:text or <attr>:”text” (use \” to escape “) [ { tag:VBD } ] [ { word:”-” } ] Numeric comparison: <attr> [==|>|<|>=|<=] <value> [ { word>100 } ] Boolean functions: <attr>::<func> EXISTS/NOT_NIL: [ { ner::EXISTS } ] NOT_EXISTS/IS_NIL IS_NUM – Can be parsed as a Java number

Syntax – Nodes (Tokens) Compound Expressions Compose compound expressions using !, &, and | Use () to group expressions Negation: !{X} [ !{ tag:/VB.*/ } ]  any token that is not a verb Conjunction: {X} & {Y} [ {word>=1000} & {word <=2000} ]  word is a number between 1000 and 2000 Disjunction: {X} | {Y} [ {word::IS_NUM} | {tag:CD} ]  word is numeric or is tagged as CD

Syntax – Sequence Regex Special Tokens [] will match any token Putting tokens together into sequences Match expressions like “from 8:00 to 10:00” /from/ /\\d\\d?:\\d\\d/ /to/ /\\d\\d?:\\d\\d/ Match expressions like “yesterday” or “the day after tomorrow” (?: [ { tag:DT } ] /day/ /before|after/)? /yesterday|today|tomorrow/

Sequence Regex – Groupings Capturing group (default): (X) Numbered from left to right as in normal regular expressions Group 0 is the entire matched expression Can be retrieved after a match using matcher.groupNodes(groupnum) Named group: (?$name X) Associate a name to the matched group matcher.groupNodes(name) Same name can be used for different parts of an expression (consistency is not enforced). First matched group is returned. Non-capturing group: (?: X)

Sequence Regex Back references String matching across tokens Use \capturegroupid to match the TEXT of previously matched sequence String matching across tokens (?m){min,max} /pattern/ To match mid-December across 1 to 3 tokens: (?m){1,3) /mid\\s*-\\s*December/

Advanced – Environments All patterns are compiled under an environment Use environments to Set default options Bind patterns to variables for later expansion Define custom string to attribute key (Class) bindings Define custom Boolean match functions

Advanced - Environments Define an new environment Env env = TokenSequencePattern.getNewEnv(); Set up environment Compile a pattern with environment TokenSequencePattern pattern = TokenSequencePattern.compile(env, …);

Advanced - Environments Setting default options Set default pattern matching behavior To always do case insensitive matching env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE); Bind patterns to variables for later expansion Bind pattern for recognizing seasons env.bind("$SEASON", "/spring|summer|fall|winter/"); TokenSequencePattern pattern = TokenSequencePattern.compile(env, “$SEASON”); Bound variable can be used as a sequence of nodes or as an attribute value. It cannot be embedded inside the String regex.

Advanced - Environments Define custom string to attribute key (Class) bindings env.bind("numcomptype", CoreAnnotations.NumericCompositeTypeAnnotation.class); Define custom boolean match functions env.bind(“::FUNC_NAME", new NodePattern<T>() { boolean match(T in) { … } });

Priorities and Multiple Patterns Can give a pattern priority Priorities are doubles (+ high priority, - low priority, 0 default) pattern.setPriority(1); List of Patterns to be matched Try the MultiPatternMatcher to get a list of non-overlapping matches MultiPatternMatcher<CoreMap> m = new MultiPatternMatcher<CoreMap>(patternList); List<CoreMap> matches = m.findNonOverlapping(tokens); Overlaps are resolved by pattern priority, match length, pattern order, and offset.

For More Help… There is a JUnitTest in the TokensRegex package called TokenSequenceMatcherITest that has some test patterns If you find a bug (i.e. a pattern that should work but doesn’t) or need more help, email angelx@cs.stanford.edu

Thanks!