Artificial intelligence & natural language processing Mark Sanderson Porto, 2000.

Artificial intelligence & natural language processing Mark Sanderson Porto, 2000

Aims To provide an outline of the attempts made at using NLP techniques in IR

Objectives At the end of this lecture you will be able to –Outline a range of attempts to get NLP to work with IR systems –Idly speculate on why they failed –Describe the successful use of NLP in a limited domain

Why? Seems an obvious area of investigation –Why not working?

Use of NLP Syntactic –Parsing to identify phrases –Full syntactic structure comparison Semantic –Building an understanding of a document’s content Discourse –Exploiting document structure?

Syntactic Parsing to identify phrases –The issues. –Explain how it’s done (a bit). –Is it worth it? Other possibilities –Grammatical tagging –Full syntactic structure comparison Explain how it’s done (a little bit). Show results.

Simple phrase identification High frequency terms could be good candidates. –Why? Terms co-occurring more often than chance. –Within small number of words. –Surrounding simple terms. –Not surrounding punctuation.

Problems Close words that aren’t phrases. “the use of computers in science & technology” Distant words that are phrases. “preparation & evaluation of abstracts and extracts”

Parsing for phrases Using parsers to identify noun phrases. Make a phrase out of a head and the head of its modifiers. “automatic analysis of scientific text” ADJ NOUN PREP NP PP

Errors Not a perfect rule by any means. –Need restrictions to eliminate bogus phrases. “automatic analysis of these four scientific texts” ADJ NOUN PREP NP PP DETQUANT

Do they work? Fagan compared statistical with syntactic, statistics won, just –J. Fagan (1987) Experiments in phrase indexing for document retrieval: a comparison of syntactic & nonsyntactic methods, in TR 87-868 - Department of Computer Science, Cornell University More research has been conducted. –T. Strzalkowski (1995) Natural language information retrieval, in Information Processing & Management, Vol. 31, No. 3, pp 397-417

Check out TREC Overview of the Seventh Text REtrieval Conference (TREC-7), E. Voorhees, D. Harman (National Institute of Standards and Technology) –http://trec.nist.gov/ –Ad hoc track Fairly even between statistical phrases, syntactic phrases and no phrases.

Grammatical tagging? Tag document text with grammatical codes? –R. Garside (1987). The CLAWS word tagging system, in The computational analysis of english: a corpus based approach, R. Garside, G. Leech, G. Sampson Eds., Longman: 30-41. Doesn’t appear to work –R. Sacks-Davis, P. Wallis, R. Wilkinson (1990). Using syntactic analysis in a document retrieval system that uses signature files, in Proceedings of 13 th ACM SIGIR Conference: 179-191.

Syntactic structure comparison Has been tried… –A. F. Smeaton & P. Sheridan (1991) Using morpho- syntactic language analysis in phrase matching, in Proceedings of RIAO ‘91, Pages 414-429 Method –Parse sentences into tree structures –When you get a phrase match Look at linking syntactic operator. Look at the residual tree structure that didn’t match Does not to work

Semantic Disambiguation –Given a word appearing in a certain context, disambiguators will tell you what sense it is. IR system –Index document collections by senses rather than words –Ask the users what senses the query words are –Retrieve on senses

Disambiguation Does it work? –No (well maybe) M. Sanderson, Word sense disambiguation and information retrieval, in Proceedings of the 17 th ACM SIGIR Conference, Pages 142-151, 1994 M. Sanderson & C.J. van Rijsbergen, The impact on retrieval effectiveness of skewed frequency distributions, in ACM Transactions on Information Systems (TOIS) Vol. 17 No. 4, 1999, Pages 440-465.

Partial conclusions NLP has yet to prove itself in IR –Agree –D.D. Lewis & K. Sparck-Jones (1996) Natural language processing for information retrieval, in Communications of the ACM (CACM) 1996 Vol. 39, No. 1, 92-101 –Sort of don’t agree –A. Smeaton (1992) Progress in the application of natural language processing to information retrieval tasks, in The Computer Journal, Vol. 35, No. 3.

Mark’s idle speculation What people think is going on always Keywords NLP

Mark’s idle speculation What’s usually actually going on Keywords NLP

Areas where NLP does work Systems with the following ingredients. –Collection documents cover small domain. –Language use is limited in some manner. –User queries cover tight subject area. –Documents/queries very short Image captions –LSI, pseudo-relevance feedback –People willing to spend money getting NLP to work

RIME & IOTA From Grenoble –Y. Chiaramella & J. Nie (1990) A retrieval model based on an extended modal logic and its application to the RIME experimental approach, in Proceedings of the 13 th SIGIR conference, Pages 25-43 Medical record retrieval system Some database’y parts Free text descriptions of cases

Indexing “an opacity affecting probably the lung and the trachea” {[p], SGN} {[bears-on], SGN} {[and], SGN} {[bears-on], SGN} {[lung], LOC}{[opacity], SGN} {[trachea], LOC} LOC - localisation SGN - observed sign

Retrieval How do we match a user’s query to these structures? –Using transformations - bit like logic. {[bears-on], SGN} {[lung], LOC}{[opacity], SGN} t - uncertainty {[lung], LOC}, t {[opacity], SGN}, t  

Tree transformation {[bears-on], SGN} {[has-for-value], SGN} {[lung], LOC}{[opacity], SGN}{[contour], SGN} {[blurred], LOC} {[opacity], SGN} {[has-for-value], SGN}, t {[has-for-value], SGN} {[contour], SGN} {[blurred], LOC} 

Term transforms Basic medical terms stored in a hierarchy. –Transformations possible again with uncertainty added. Level 1Level 2Level 3 tumourcancersarcoma hygroma kystepolykystosis pseudokyst polyppolyposis

Isn’t this a bit slow? Yes Optimisation –Scan for potential documents. –Process them intensively. Evaluation? –Not in that paper.

Not unique SCISOR –P.S. Jacobs & L.F. Rau (1990) SCISOR: Extracting Information from On-line News, in Communications of the ACM (CACM), Vol. 33, No. 11, 88-97

Why do they work? Because of the restrictions –Small subject domain. –Limited vocabulary. –Restricted type of question. Compare with large scale IR system. –Keywords are good enough. –Long time to set up. –Hard to adapt to new domain.

Anything else for NLP? Text Generation –IR system explaining itself?

Conclusions By now, you will be able to –Outline a range of attempts to get NLP to work with IR systems –Idly speculate on why they failed –Describe the successful use of NLP in a limited domain

Artificial intelligence & natural language processing Mark Sanderson Porto, 2000.

Similar presentations

Presentation on theme: "Artificial intelligence & natural language processing Mark Sanderson Porto, 2000."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Artificial intelligence & natural language processing Mark Sanderson Porto, 2000.

Similar presentations

Presentation on theme: "Artificial intelligence & natural language processing Mark Sanderson Porto, 2000."— Presentation transcript:

Similar presentations

About project

Feedback