Presentation on theme: "An Attack on Data Sparseness JHU –Tutorial June 11 2003."— Presentation transcript:
An Attack on Data Sparseness JHU –Tutorial June 11 2003
OVERVIEW What is this project about? What is gate? Lab assignment
Basic Approach – (from RG talk) Build Linguistic Patterns person was appointed as post of company company named person to post Apply patterns to text and fill data base
Getting these patterns … Use training data to gather information about the contexts of the important bits of text. Write an algorithm that automatically makes use of the contextual information to further identify new important bits and labels them.
It is a difficult task We are already pretty good at Identifying and locating People Locations Organizations Dates Times What if we could do more?
Would it help to tag/replace noun phrases? Astronauts aboard the space shuttle Endeavour were forced to dodge a derelict Air Force satellite Friday. HUMANS aboard SPACE_VEHICLE dodge SATELLITE TIMEREF
We could transform the training data and get more HUMANS DODGE SATELLITE After parsing: HUMANS aboard SPACE_VEHICLE dodge SATELLITE TIMEREF
Could we know these are the same? The IRA bombed a family owned shop in Belfast yesterday. FMLN set off a series of explosions in central Bogota today. ORGANIZATION ATTACKED LOCATION DATE
Lexicography Data Sparseness again.. Sever BODYPART Sever an arm Sever a finger Sever FASTENER Sever the bond.. Sever the links …
Machine translation Ambiguity of words often means that a word can translate several ways. Would knowing the semantic class of a word, help us to know the translation?
Sometimes... Crane the bird vs crane the machine Bat the animal vs bat for cricket and baseball Seal on a letter vs the animal
SO.. P(translation(crane) = grulla | animal) > P(translation(crane) = grulla) P(translation(crane) = grua | machine) > P(translation(crane) = grua | machine) Can we show the overall effect lowers entropy?
Language Modeling – Data Sparseness again.. We need to estimate Pr (w 3 | w 1 w 2 ) If we have never seen w 1 w 2 w 3 before Can we instead develop a model and estimate Pr (w 3 | C 1 C 2 ) or Pr (C 3 | C 1 C 2 )
Overview Noun Phrases Identified Head Nouns Identified People marked Locations, dates, currencies, organizations Also marked CORPUS
Overview Human Annotated with semantic tags– Noun Phrases Only
Overview Test portion Training portion Machine Learning to improve this
The Environment GATE – an environment which conforms to the TIPSTER architecture Provides many tools for processing language and a standard method for managing documents and any new information associated with the document
Gate - Documents have annotations ~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~GEORGE BUSH~~~~ ~~~~~~~~~~~~~~~~~~~~~~ GEORGE BUSH at offset 104-114 is a person
There may be more than one annotation ~~~~~~~~~~~~~~~~~~~~~~ ~~~~The ruthless criminal~~~~ ~~~~~~~~~~~~~~~~~~~~~~ criminal at offset 104-122 is a human Is a noun Is the head of an noun phrase
Documents belong to collections (a corpus in GATE) Collections can be loaded into GATE New collections can be created Documents can be added or removed Applications can run over whole collections
Applications – processing resources Programs (tools) can be loaded into gate An Application consists of forming a pipeline of some tools In the demo, you will see two applications
Annie – with defaults Sentence Splitter POS tagger NE recognizer Tokenizer Plus more
Using gate in today’s lab To view already processed documents To process new documents To process documents, you must have both an application and a corpus
To learn more.. http://www.gate.ac.uk Tutorials, slides, downloadable versions for PC, Linux, Solaris, etc.
The lab Follow the directions in /export/ws03sem/lab/gate.lab Use the internet or Grolier to find Paragraphs or documents about bats that fly and bats that hit a ball, cricket bat or baseball bat
Which bat is it? Use the web texts as training data for the context – you can load them into gate or use them as is. Try a bag of words approach
The idea Texts about flying bats Texts about movable solid ones The pitcher held the bat firmly NEW
Resources Porter Stemmer Gate Can collect trigrams, or bigrams from the training data..
Comments A very primitive approach to the problem Use your work to say which kind of ‘bat’ is used in the text bat.txt Try your same technique for ‘seal’ There is a file called seal.txt to test on
Finally If you are very brave can you find the semantic classes for ‘chicken’ in the chicken.txt file? Careful – this one has a lot of metaphorical use. Have fun!
Tag Set Longman’s Dictionary (LDOCE) 2000 word defining vocabulary 34 semantic categories over subject codes Over 5000 combination markings Gives us 85% coverage of NP’s but only contains 35% of the vocabulary
Wordnet Developed at Princeton (George Miller) About the same coverage on a sample Defined synsets instead of senses Arranged with ‘IS A’ relations which can serve as a semantic category The English acts as an interlingua to EuroWordnet.
Corpus BNC – 100 million words – mostly spoken POS tagged with CLAWS English side of parallel texts possibly 80 million words Aligned Some french – some chinese some arabic Or possibly UN data supplied by the MT team
Evaluation This must be decided before July Baselines should be presented for the opening talk The closing talk should include baseline plus as many measures of improvement as we can come up with
Closing presentation One half day for each of the three projects Each person should plan to talk One part of the team should be devoted to this aspect of the project
Evaluation – suggested focus We focus on showing that we can improve the entropy for MT.
Techniques Basically two possibilites Extend techniques from disambiguation for assigning semantic category and then subject area (word focused) Use machine learning to learn about the contexts and features of a particular semantic category – then tag those (semantic category focused)
Today 12-1 Roberto and Fabio Machine learning Wordnet and conceptual density Ldoce – Wordnet correspondence 1-2 Lunch 2-3:30 Tagging texts and discussion 3:30- 5:30 Gate Tutorial
Tomorrow Annotation tool Division of labor Plan Rome meeting End at 1:00
Why do it? Text Extraction Lexicography Summarization Machine Translation Language Modeling