Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Extraction and Automatic Summarisation *

Similar presentations


Presentation on theme: "Information Extraction and Automatic Summarisation *"— Presentation transcript:

1 Information Extraction and Automatic Summarisation *

2 How IE fits in with IR l IR selects a few relevant documents from many l IE starts with one or a few relevant documents l IE pulls out the words and phrases most central to the meaning of that/those documents to produce an extract.

3 Two process associated with information extraction l determination of facts to go into structured fields in a database. l extraction of text that can be used to summarise an item. l In the first case only a subset of the important facts in an item may be identified and extracted. The term slot is used to define a particular category of information to be extracted. Slots are organised into semantic frames.

4 What do we most want to know from a journal article about agriculture? l AGENTchemical agent applied l CVcultivar (e.g. King Edward) l HLPhigh level property (e.g. yield) l INFinfluence (e.g. drought) l LABsite of test (e.g. laboratory) l LLPlow level property (e.g. root mass) l LOClocation l PESTpest or disease l SOILsoil l SPEC crop species (e.g. potato)

5 Automatic Abstracting l In the second case, rather than trying to determine specific facts, the goal of document summarisation is to extract a summary of an item maintaining the most important ideas while significantly reducing its size. For journal articles, this is called automatic abstracting. The abstract is a way for the user to determine the utility of an article without having to read the whole item.

6 Kupiek’s heuristics l Sentence length feature that requires the sentence to be over five words in length. l Fixed phrase feature that looks for the existence of “phrase” cues, e.g. “in conclusion…”. l Paragraph feature that places emphasis on the first ten and the last five paragraphs in an item and also the location of the sentences within the paragraph. l Thematic word feature that uses word frequency. l Uppercase word feature that places emphasis on proper names and acronyms. l discovered that location based heuristics give better results than the frequency based features.

7 Paice’s rules l Frequency Keyword Approach: First find a set of index terms for the document (manually, mid-frequency, tf * idf, words occurring in the title, etc.). Then choose the sentences which contain most keywords. l Location: The first sentence in a paragraph is most central to the theme of a text. The last sentence is the next most central. l Cue method: Not actually keywords, but their presence in a document show that the sentence is (or is not) important. These may be bonus words, e.g. greatest, significant, or stigma words, e.g. hardly, impossible. l Indicator phrases, e.g. “The main aim of our paper is to describe …”, “Our investigation has shown that …”.

8 Hoey method: cohesion in text. l The most important sentences in a document are those which are related to the largest number of other sentences. Find how many concepts in each sentence are related to concepts in other sentences. Concepts may be related by: l Exact match, e.g. computer and computer; l Grammatical variants e.g. computer, computing; l Synonyms e.g. sedate, tranquilise, drug ; l Antonymy e.g. cold, hot ; l General-specific e.g. scientists, biologists ;

9 Hoey (2) l Form a repetition net, with entries in the form s ( a, b) such as 26 ( 6, 4) meaning sentence no. 26 is bonded to 6 earlier sentences and 4 later sentences. l If a + b is high, the sentence is central to the topic ; l If only b is high, the sentence is a topic opener ; l If only a is high, the sentence is topic closing.

10 Hoey (3) l Cohesion in text is concerned with explicit references within a sentence which can only be understood by reference to material elsewhere in the text. l Anaphora come after their explicit mention in the text, e.g. Marie Curie was born in Warsaw. She devoted her life to the study of radioactivity. l Cataphora come before their explicit mention in the text, e.g. He was to become the best known physicist of his generation. His name was Albert Einstein.

11 Generating Canned Text l This paper studies the effect of AGENT on the HLP of SPEC l OR l This paper studies the effect of INF on the HLP of SPEC l when it is infested by PEST. l An experiment was undertaken l using cultivars CV l [in, at] LOC l where the soil was SOIL. l The HLP [is, are] measured by analysing the LLP.

12 Extracts vs. Abstracts (Mani, p6) l An extract is a summary consisting entirely of material copied from the input l An abstract is a summary at least some of whose material is not present in the input.


Download ppt "Information Extraction and Automatic Summarisation *"

Similar presentations


Ads by Google