The Wonderful World of Tregex

Slides:



Advertisements
Similar presentations
CST8177 awk. The awk program is not named after the sea-bird (that's auk), nor is it a cry from a parrot (awwwk!). It's the initials of the authors, Aho,
Advertisements

XSL eXtensible Stylesheet Language. What is XSL? XSL is a language that allows one to describe a browser how to process an XML file. XSL can convert an.
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Al Khalifa et al., ICDE 2002.
Regular Expressions, Backus-Naur Form and Reverse Polish Notation.
Syntax. Definition: a set of rules that govern how words are combined to form longer strings of meaning meaning like sentences.
Resources: Question Classification Schemes, Graesser et al. Automatic Factual Question Generation from Text (Chapter 3), Michael Heilman.
Grammars, constituency and order A grammar describes the legal strings of a language in terms of constituency and order. For example, a grammar for a fragment.
การใช้ระบบปฏิบัติการ UNIX พื้นฐาน บทที่ 4 File Manipulation วิบูลย์ วราสิทธิชัย นักวิชาการคอมพิวเตอร์ ศูนย์คอมพิวเตอร์ ม. สงขลานครินทร์ เวอร์ชั่น 1 วันที่
How to perform tree surgery Anna Rafferty Marie-Catherine de Marneffe.
Using Treebanks tgrep2 Lecture 2: 07/12/2011. Using Corpora For discovery For evaluation of theories For identifying tendencies – distribution of a class.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
CS 206 Introduction to Computer Science II 02 / 20 / 2009 Instructor: Michael Eckmann.
An Introduction to Using Semgrex Chloé Kiddon. What is Semgrex? A java utility (in javanlp) for identifying patterns in Stanford JavaNLP SemanticGraph.
MIRC Matthew Forest. Introduction mIRC itself is a program designed for text based messaging via the IRC (internet relay chat) protocol. (Link:
1 More Xkwic and Tgrep LING 5200 Computational Corpus Linguistics Martha Palmer March 2, 2006.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Guide To UNIX Using Linux Third Edition
Introduction to Unix (CA263) Introduction to Shell Script Programming By Tariq Ibn Aziz.
Microsoft Access 2010 Chapter 7 Using SQL.
© 2010 IBM Corporation IBM Experience Modeler - Theme Editor Installing Python Image Library Presenter’s Name - Presenter’s Title DD Month Year.
CSC 8310 Programming Languages Meeting 2 September 2/3, 2014.
An Introduction to TokensRegex
MCB Lecture #3 Sept 2/14 Intro to UNIX terminal.
1 Microsoft Access 2002 Tutorial 5 – Enhancing a Table’s Design, and Creating Advanced Queries and Custom Forms.
8 Copyright © 2004, Oracle. All rights reserved. Creating LOVs and Editors.
ASP.NET Programming with C# and SQL Server First Edition
XP New Perspectives on Microsoft Access 2002 Tutorial 51 Microsoft Access 2002 Tutorial 5 – Enhancing a Table’s Design, and Creating Advanced Queries and.
Navigating XML. Overview  Xpath is a non-xml syntax to be used with XSLT and Xpointer. Its purpose according to the W3.org is  to address parts of an.
Introduction to Shell Script Programming
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
Perl Tutorial Presented by Pradeepsunder. Why PERL ???  Practical extraction and report language  Similar to shell script but lot easier and more powerful.
#N14 Pattern Value (aka Substring attribute) SDD 1.1 Proposal.
Dedan Githae, BecA-ILRI Hub Introduction to Linux / UNIX OS MARI eBioKit Workshop; Nov , 2014.
Linux+ Guide to Linux Certification Chapter Four Exploring Linux Filesystems.
XP New Perspectives on Integrating Microsoft Office XP Tutorial 2 1 Integrating Microsoft Office XP Tutorial 2 – Integrating Word, Excel, and Access.
ECA 228 Internet/Intranet Design I XSLT Example. ECA 228 Internet/Intranet Design I 2 CSS Limitations cannot modify content cannot insert additional text.
XML 2nd EDITION Tutorial 4 Working With Schemas. XP Schemas A schema is an XML document that defines the content and structure of one or more XML documents.
Chapter 1 – Matlab Overview EGR1302. Desktop Command window Current Directory window Command History window Tabs to toggle between Current Directory &
CSA2050 Introduction to Computational Linguistics Parsing I.
Perl Tutorial. Why PERL ??? Practical extraction and report language Similar to shell script but lot easier and more powerful Easy availablity All details.
Introduction to Programming Using C An Introduction to Operating Systems.
®® Microsoft Windows 7 for Power Users Tutorial 3 Managing Folders and Files.
CSCI 330 UNIX and Network Programming Unit IV Shell, Part 2.
Graphs & Trees Intro2CS – week General Graphs Nodes (Objects) can have more than one reference. Example: think of implementing a Class “Person”
Agenda The Linux File System (chapter 4 in text)
1 P51UST: Unix and Software Tools Unix and Software Tools (P51UST) Awk Programming Ruibin Bai (Room AB326) Division of Computer Science The University.
DAY 18: MICROSOFT ACCESS – CHAPTER 3 CONTD. Akhila Kondai October 21, 2013.
Database: SQL, MySQL, LINQ and Java DB © by Pearson Education, Inc. All Rights Reserved.
XP New Perspectives on Microsoft Office Access 2003, Second Edition- Tutorial 5 1 Microsoft Office Access 2003 Tutorial 5 – Enhancing a Table’s Design.
Linux+ Guide to Linux Certification, Second Edition
CHAPTER 7 LESSON C Creating Database Reports. Lesson C Objectives  Display image data in a report  Manually create queries and data links  Create summary.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
Tarik Booker CS 122. What we will cover… Tables (review) SELECT statement DISTINCT, Calculated Columns FROM Single tables (for now…) WHERE Date clauses,
Agenda The Linux File System (chapter 4 in text) Directory Structures / Navigation Terminology File Naming Rules Relative vs Absolute pathnames Unix Commands:
Product Training Program
Some other query issues:
Regular Expressions, Backus-Naur Form and Reverse Polish Notation
Commands Basic syntax of shell commands UNIX or shell commands have a basic structure command -options target command comes first (such as cd or ls) any.
Looking for Patterns - Finding them with Regular Expressions
CO4301 – Advanced Games Development Week 2 Introduction to Parsing
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
LING/C SC/PSYC 438/538 Lecture 21 Sandiway Fong.
LING/C SC 581: Advanced Computational Linguistics
LING/C SC 581: Advanced Computational Linguistics
Microsoft Office Access 2003
Lecture 7: Introduction to Parsing (Syntax Analysis)
Nagendra Vemulapalli Access chapters 3&5 Nagendra Vemulapalli 1/18/2019.
Regular Expressions and Grep
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.
LING/C SC 581: Advanced Computational Linguistics
Presentation transcript:

The Wonderful World of Tregex

tregex.sh “NP < NN” treeFilename What is Tregex? A java program for identifying patterns in trees Like regular expressions for strings, based on tgrep syntax Simple example: NP < NN filters cigarette its in croco-dilite using stopped firm The PRP IN PP VBG VP VBD DT S tregex.sh “NP < NN” treeFilename NN NP NN NP NN NP NN NP NP NN NP NN NN NP NN NN

Syntax (Node Descriptions) The basic units of Tregex are Node Descriptions Descriptions match node labels of a tree Literal string to match: NP Disjunction of literal strings separated by ‘|’: NP|PP|VP Regular Expression (Java 5 regex): /NN.?/ Matches NN, NNP, NNS Wildcard symbol: __ (two underscores) Matches any node (warning: can be SLOW!) Descriptions can be negated with !: !NP Preceding desc with @ uses basic category @NP will match NP-SBJ

Syntax (Relations) Relationships between tree nodes can be specified There are many different relations. Here are a few: Symbol Description A < B A is the parent of B A << B A is an ancestor of B A $ B A and B are sisters A $+ B B is next sister of A A <i B B is ith child of A A <: B B is only child of A A <<# B A on head path of B A <<- B B is rightmost descendent A .. B A precedes B in depth-first traversal of tree A <+(C) B A dominates B via unbroken chain of Cs

Building complex expressions Relations can be strung together for “and” All relations are relative to first node in string NP < NN $ VP “An NP over an NN and w/ sister VP” & symbol is optional: NP < NN & $ VP Nodes can be grouped w/ parentheses NP < (NN < dog) “An NP over an NN that is over ‘dog’ ” Not the same as NP < NN < dog Ex: NP < (NN < dog) $ (VP <<# (barks > VBZ)) “An NP both over an NN over ‘dog’ and with a sister VP headed by ‘barks’ under VBZ”

Other Operators on Relations Operators can be combined via “or” with | Ex: NP < NN | < NNS “An NP over NN or over NNS” By default, & takes precedence over | Ex: NP < NNS | < NN & $ VP “NP over NNS OR both over NN and w/ sister VP” Equivalent operators are left-associative Any relation can be negated with “!” prefix Ex: NP !<< NNP “An NP that does not dominate NNP”

Grouping relations To specify operation order, use [ and ] Ex: NP [ < NNS | < NN ] $ VP “An NP either over NNS or NN, and w/ sister VP” Grouped relations can be negated Just put ! before the [ Already we can build very complex expressions! NP <- /NN.?/ > (PP <<# (IN ![ < of | < on])) “An NP with rightmost child matching /NN.?/ under a PP headed by some preposition (IN) that is not either ‘of’ or ‘on’ ”

Named Nodes Sometimes we want to find which nodes matched particular sub-expressions Ex: /NN.?/ $- @JJ|DT What was the modifier that preceded the noun? Name nodes with = and if expression matches, we can retrieve matching sub-expr with name Ex: /NN.?/ $- @JJ|DT=premod Subtree with root matching @JJ|DT is stored in a map under key “premod” Note: named nodes are not allowed in scope of negation

Optional Nodes Sometimes we want to try to match a sub-expression to retrieve named nodes if they exist, but still match root if sub-expression fails. Use the optional relation prefix ‘?’ Ex: NP < (NN ?$- JJ=premod) $+ CC $++ NP Matches NP over NN with sisters CC and NP If NN is preceded by JJ, we can retrieve the JJ using the key “premod” If there is no JJ, the expression will still match Cannot be combined with negation

Use of the tregex GUI application Double-click the Stanford Tregex application (Mac OS X) or run-tregex-gui (Windows/Linux) Equivalent to running: java -mx300m –cp stanford-tregex.jar edu.stanford.nlp.trees.tregex.gui.TregexGUI Set preferences if necessary (e.g., for non-English treebanks) Load trees from File menu Enter a search pattern in the Pattern box Click Search Also try the useful Help button

Use of tregex from the command-line tregex.sh “pattern” filename tregex.bat “pattern” filename Equivalent to: java -cp 'stanford-tregex.jar:' edu.stanford.nlp.trees.tregex.TregexPattern “pattern” filename The pattern almost always needs to be quoted because of special characters it contains like < and > If the filename is a directory, all files under it are searched

Command-line options Place any of these before the pattern: -C only count matches, don’t print -w print whole matching tree, not just matching subtree -f print filename -i <filename> read search pattern from <filename> rather than the command line -s print each match on one line, instead of multi-line pretty- printing -u only print labels of matching nodes, not complete subtrees !! -t print terminals only

Use of Tregex Java classes Tregex usage is like java.util.regex String s = “@NP $+ (CC=conj $+ (@NP <- /^PP/))”; TregexPattern p = TregexPattern.compile(s); TregexMatcher m = p.matcher(tree); while (m.find()) { m.getMatch().pennPrint(); } Named nodes are retrieved with getNode() while (m.find()) { Tree conjSubTree = m.getNode(“conj”); System.out.println(conjSubTree.value()); }

Options TregexPatterns use a HeadFinder for <<# and BasicCategory map for @ BasicCategory map is Function from String  String Defaults are for English Penn Treebank To change these, use TregexPatternCompiler HeadFinder hf = new ChineseHeadFinder(); TreebankLanguagePack chineseTLP = new ChineseTreebankLanguagePack(); Function bcf = chineseTLP.getBasicCategoryFunction(); TregexPatternCompiler c = new TregexPatternCompiler(hf, bcf); TregexPattern p = c.compile(s);

Tregex (and Tsurgeon) Available for download at: http://nlp.stanford.edu/software/tregex.shtml Tregex and Tsurgeon were initially written by Galen Andrew and Roger Levy Roger Levy and Galen Andrew. 2006. Tregex and Tsurgeon: tools for querying and manipulating tree data structures. Proceedings of LREC 2006. http://nlp.stanford.edu/pubs/levy_andrew_lrec2006.pdf Formats handled by the tool: Penn Treebank Others if you provide Java TreeReader’s E.g., it has been used with CCG

The Wonderful World of Tregex ENJOY!!! The Wonderful World of Tregex