Presentation on theme: "Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation."— Presentation transcript:
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation Alexander Gelbukh
2 Previous Chapter: Conclusions Tagging, word sense disambiguation, and anaphora resolution are cases of disambiguation of meaning Useful in translation, information retrieval, and text undertanding Dictionary-based methods good but expensive Statistical methods cheap and sometimes imperfect... but not always (if very large corpora are available)
3 Previous Chapter: Research topics Too many to list New methods Lexical resources (dictionaries) = Computational linguistics
4 Contents Language levels Syntax Dependency approach Constituency-based approach Head-driven approach Grammars and parsing Ambiguity and disambiguation
5 Language levels Letters are built up into words Words into sentences Sentences into text Each level has its own representation This allows for modular processing A module describes one level or transforms from one level to another
6 Source of language complexity: 1-D
8 Linguistic processor translates between representations
9 General scheme of text processing Linguistic processor uses linguistic knowledge Applied system uses other types of knowledge (e.g., Artificial Intelligence)
10 Language levels Morphological: words Syntactic: sentences Semantic: meaning Pragmatic: intention...?
11 Fine structure of linguistic processor
12 Example of text Science is important for our country. Science is important for our country. The Government pays it much attention. The Government pays it much attention.
13 Textual representation Text is a sequence of letter. S c i e n c e i s i m p o r t a n t f o r o u r c o u n t r y. T h e G o v e r n m e n t p a y s i t m u c h a t t e n t i o n.
15 Morphological representation A sequence of words.
16 Syntactic parsing
17 Syntactic representation A sequence of syntactic trees.
18 Syntactic representation What happened? With whom happened?... their details
19 Semantic analysis Next lecture...
20 Syntax The structure describing the relationships between words in a sentence Describes the relationships implied by grammatical characteristics not by meaning Often allows for simple paraphrasing John reads the book The book is read by John
21 Early approach: Dependency syntax Tree Nodes: words Arcs: modified by Modifies means adds details, clarifies, chooses of many... makes more specific Arcs are typed Types are: subject, object, attribute,... Subject Object Recipient Attribute
22... Dependency syntax General situation: pay More specifically: the one where: who pays is government what is paid is attention to whom it is paid is it More specifically: attention that is much Subject Object Recipient Attribute
23 Advantages/disadvantages of Dependency Syntax Advantages Solid linguistic base Rather direct translation into semantics Easily applicable to languages with free word order Korean? Russian, Latin This is why solid linguistic base: good for classical languages! Disadvantages No nice mathematical base No simple algorithms
24 Most popular approach: Constituency (Phrase Structure grammars) Tree Nodes: nested segments of the phrase Cannot intersect, only nested Usually are labeled with part-of-speech names Arcs: nesting In classical approach, arcs are not labeled [ [ Our Government ] [ pays [ much attention ] [ to it ] ] ]
25 Constituency [ [ Our Government ] [ pays [ much attention ] [ to it ] ] ] Our Government pays much attention to it
26 Constituency [ [ Our R Government N ] NP [ pays V [ much A attention N ] NP [ to P it R ] PP ] VP ] S R: pronoun NP: noun phrase N: nounVP: verb phrase V: verb PP: prepositional phrase A: adjective S: sentence
27 Constituency: graphical representation [ [ Our Government ] NP [ pays [ much attention ] NP [ to it ] PP ] VP ] S S VP NP NP PP NP VP NP NP R N V A N P R Our Government pays much attention to it
28 Phrase structure grammar Enumerates possible configurations at nodes Usually recursive S NP VP NP A NP NP R NP NP P NP NP N VP VP NP PP VP V S VP NP NP PP NP VP NP NP R N V A N P R Our Government pays much attention to it
29 Context-independency hypothesis A configuration is possible or not, regardless of where it is used Wherever you find VP NP PP, it can be VP Wherever you find NP VP, it can be S If you can put together S that covers all the sentence, it is a grammatically correct description With this, given a suitable grammar, you can List all sentences of a language List only correct sentences of that language List all and only correct structures Correctness means a native speakers intuition
30 Generative idea Find a grammar to list all and only correct sentences (with their structures) of a language This is a complete description of that language! How can be useful in analysis? Reverse the grammar
31 Parsing Given a grammar and a sentence Find all possible structures That describe this sentence with this grammar Many methods. Not discussed today. A lot of research. Very fast algorithms Complexity: cubic in the number of words in the sentence (there are better methods, up to 2.8) Problem: combinatorics of variants
32 Advantages and disadvantages of consitituency approach Advantages Nice mathematics, very well understood Efficient analysis algorithms, very well-elaborated Good for languages with fixed word order English. Chinese? Disadvantages Difficult translation into semantics Bad when it comes to freer word order Even in English! Worse in other languages
33 Head-driven approaches Combine some advantages of dependency-based and constituency-based approaches Syntax is still fixed-order. But word dependency information is added Easier translation into semantics More linguistically-based How? In each constituent, the main word (head) is marked It modifies the head of the larger constituent [ [ Our Government ] [ pays [ much attention ] [ to it ] ] ]
34 Syntactic ambiguity I see a cat with a telescope I see [a cat] [with a telescope] I use a telescope to see a cat I see [ a cat [with a telescope] ] I see a cat that has a telescope Nearly any preposition causes ambiguity Dozens, thousands, millions of variants for a sentence! Because their numbers multiply I see a cat with a telescope in a garden at the shore of a river
35 Ambiguity resolution Syntactic means are not enough Is telescope more related to see or to cat? Statistical methods: is it used with see or cat? Dictionary-based methods: does it share more meaning with see or cat? Path length in a dictionary of semantic relationships Ideally, context should be analyzed, and reasoning applied: I see a cat with a telescope. It keeps the telescope in its left paw. Now no good methods for this.
36 Shallow parsing Due to the HUGE problems in resolving ambiguity Do not resolve it! Do what you can de well I see [a cat] [with a telescope] [in a garden] [at the shore] [of a river] Better than nothing Can be done well
37 Evaluation PARSEVAL international contents A practical parser usually gives only one variant Implies disambiguation! Manually built corpora (treebanks) Compare what the program did with what humans did
38 One of the uses in IR: Lexical ambiguity resolution Syntactic analysis helps in POS disambiguation: Oil is used well in Mexico. Oil well is used in Mexico. Well = ? But does not help in WSD: I deposited my money in an international bank. I live on a beautiful bank of Han river.
39 Research topics Faster algorithms E.g. parallel Handling linguistic phenomena not handled by current approaches Ambiguity resolution! Statistical methods A lot can be done
40 Conclusions Syntax structure is one of intermediate representations of a text for its processing Helps text understanding Thus reasoning, question answering,... Directly helps POS tagging Resolves lexical ambiguity of part of speech But not WSD-type ambiguities A big science in itself, with 50 (2000?) years of history