Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University.

Similar presentations


Presentation on theme: "1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University."— Presentation transcript:

1 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University of Surrey e-Science day at the Surrey Research Park, 2 December 2002

2 2 Talk Outline  Computing Intelligently  The complexity of science  The triumvirate of understanding  Dealing with information deluge  The Missing Link: Images and Text  Need for/of the Grid  Afterword

3 3 Talk Outline  Computing Intelligently  The complexity of science  The triumvirate of understanding  Dealing with information deluge  The Missing Link: Images and Text  Need for/of the Grid  Afterword

4 4 Computing Intelligently? Knowledge IntelligenceCognition Language; Images Symbols; Planning; Learning, Thinking; Creativity

5 5 Computing Intelligently? Knowledge Intelligence Cognition Language; Images Symbols; Planning; Learning, Thinking; Creativity Artificially intelligent computing systems attempt to solve problems based on an interpretation of work in psychology, neurobiology, linguistics, mathematics and philosophy.

6 6 Knowledge-based INFORMATION EXTRACTION Cognition based INFORMATION VISUALIZATION Intelligence based on SCIENTOMETRICS/ BIBLIOMETRICS IR Intelligent: INFORMATION RETRIEVAL The triumvirate of understanding

7 7 Knowledge-based INFORMATION EXTRACTION Cognition based INFORMATION VISUALIZATION Intelligence based on SCIENTOMETRICS/ BIBLIOMETRICS IR Intelligent: INFORMATION RETRIEVAL The triumvirate of understanding Major text data bases are online: MEDLINE (11 million papers); Physical Review Online Archive (c. 1890 to date); US Patent Office (all patents from 1900 onwards); Genome Data bases

8 8 Knowledge-based INFORMATION EXTRACTION Cognition based INFORMATION VISUALIZATION Intelligence based on SCIENTOMETRICS/ BIBLIOMETRICS IR Intelligent: INFORMATION RETRIEVAL The triumvirate of understanding Major text and image data bases are online: Reuters News (c. 3000 stories per day); Spectroscopy and analytical data (NIS data bases); Chemical Abstracts, where currently structure diagrams are ignored; Crime-related images with annotated information

9 9 Knowledge-based INFORMATION EXTRACTION Cognition based INFORMATION VISUALIZATION Intelligence based on SCIENTOMETRICS/ BIBLIOMETRICS IR Intelligent: INFORMATION RETRIEVAL The triumvirate of understanding Major text and image data bases are online: Recently, studies of how science and technology evolves have been related to issues of business management particularly the emergence of competition, disruptive technologies, and opportunities for collaboration across disciplines.

10 10 Knowledge-based INFORMATION EXTRACTION Cognition based INFORMATION VISUALIZATION Intelligence based on SCIENTOMETRICS/ BIBLIOMETRICS IR Intelligent: INFORMATION RETRIEVAL The triumvirate of understanding Major text and image data bases are online: Such methods are used essentially with structured spatial and temporal data Abstract non-spatial and atemporal data, for example, free text as found in journal papers, in various abstracts data bases (cf MEDLINE), in electronic mail comprising user-to-expert communication, or in web- access patterns, are typically visualised using the so- called thematic landscapes. This would need the GRID.

11 11 Knowledge-based INFORMATION EXTRACTION Cognition based INFORMATION VISUALIZATION Intelligence based on SCIENTOMETRICS/ BIBLIOMETRICS IR Intelligent: INFORMATION RETRIEVAL The triumvirate of understanding Major text and image data bases are online: Reuters News (c. 3000 stories per day); Spectroscopy and analytical data (NIS data bases); Chemical Abstracts, where currently structure diagrams are ignored; Crime-related images with annotated information

12 12 The triumvirate of understanding: Need for/of the Grid  Coordinating data sets based on common sets of metadata: need for standards beyond those for architecture of the Grid (OGSA)  Grid-enabling text analysis systems would enable processing of large volumes of distributed data  Grids provide the infrastructure for development of generic computing applications capable of dealing with and combining results of analysis of various types of data – language, images, graphs.

13 13 Computing Intelligently? A knowledge-based system can be programmed to reason over a set of facts, propositions, rules and rules of thumb and, sometimes, the system may come to the same conclusion as a human being.

14 14 Computing Intelligently – with rules of thumb about images? Recognising and reasoning about the visual environment something that people do extraordinarily well; In these abilities an average three year old makes the most sophisticated computer vision system look embarrassingly inept

15 15 Computing Intelligently – with rules of thumb about images?

16 16 Computing Intelligently – with rules of thumb about words? Natural Language. A person’s native tongue; organic, ambiguous, creative, wilful Natural Language Processing. Processing of natural language (e.g., English) by a computer to facilitate communication with the computer or for other purposes, such as word processors, computer-based dictionaries and thesauri, summarizers, machine translators, text filters, grammar checkers……….

17 17 Talk Outline  Computing Intelligently  The complexity of science  The triumvirate of understanding  Dealing with information deluge  The Missing Link: Images and Text  Need for/of the Grid  Afterword

18 18 Complexity of science LEXICAL DIFFICULTY YEAR OF PUBLICATION (BETWEEN 1930 & 1990)

19 19 Complexity of science Lexical processes used by scientists involve: repetition of lexical items comprising the specific vocabulary of a subject domain inventing new words borrowing words from other domains re-defining words or terms Such processes contribute significantly to the organisation and communication of tacit and explicit knowledge.

20 20 Complexity of science We have developed a computer-based method that compares the relative occurrence of single words in a English-scientific paper (or a collection or corpus of papers) with the occurrence of the words in a representative sample of contemporary English language. The British National Corpus is a 100 million digital collection of written (and spoken) English written/spoken during 1975- 1993. Three-quarters of the text is drawn from (A-level+) natural, social, applied sciences, from arts and culture, commerce and finance. The other quarter includes works of fiction and popular science. BNC type corpora are used extensively in producing dictionaries for general use.

21 21 Complexity of science  Leo Esaki discovered a new semi-conductor device, the tunnel diodes in 1957.  The super-fast, current-switching device earned Esaki a Nobel Prize, and yet technological obstacles hindered widespread use in conventional, silicon-based circuits.  Recent developments in tunnel diodes could help chip- makers boost silicon's speed while further shrinking chips. We have developed a text corpus, comprising 100- odd journal papers, published between 1980-2000, containing over 430,000 words, on the topic of tunnel diodes or more precisely on resonant tunnel diodes.

22 22 Complexity of science A lexico-morphological signature of discovery? Weird/excessive use of tunnel:Frequency relative to BNC Surrey Corpus (a) British National Corpus (b) tunnel501 tunnels32 tunnelled701 tunnelling6851 Magnetotunneling does not exist in the British National Corpus

23 23 Complexity of science A lexico-morphological signature of the discovery of tunnel diodes? Lexical ‘productivity’ of tunnel & resonant: Frequently used compound words

24 24 unipolar resonant tunneling diode bipolar light-emitting resonant tunneling diode resonant interband tunneling diode - RITD interband resonant tunneling diode delta doped resonant tunneling diode quantum well resonant tunneling diode resonant tunneling diode double-barrier resonant tunneling diode interband double barrier tunneling diode tunneling diode Same thing? Complexity of science Lexicomorphological signature: Compound Words

25 25 Complexity of science Information from journals is passed into patents.

26 26 Complexity of science Visualising fashions in science and technology: The movement of iconic terms.

27 27 Talk Outline  Computing Intelligently  The complexity of science  The triumvirate of understanding  Dealing with information deluge  The Missing Link: Images and Text  Need for/of the Grid  Afterword

28 28 The triumvirate of understanding Knowledge IntelligenceCognition

29 29 The triumvirate of understanding with apologies to Plato Knowledge about, knowledge by description: knowledge of a person, thing, or perception gained through information or facts about it rather than by direct experience. An impersonation of intelligence; an intelligent or rational being; esp. applied to one that is or may be incorporeal; a spirit COGNITION: The action or faculty of knowing taken in its widest sense, including sensation, perception, conception, etc., as distinguished from feeling and volition. Language; Images Symbols; Planning; Learning, Thinking; Creativity

30 30 The triumvirate of understanding with apologies to Aristotle Knowledge of a person, thing, or other entity (e.g. sense-datum, universal) by direct experience of it, as opposed to knowing facts about it. So knowledge of, by, acquaintance INTELLIGENCE: Knowledge as to events, communicated by or obtained from another; information, news, tidings. COGNITION: A product of such an action: a sensation, perception, notion, or higher intuition Language; Images Symbols; Planning; Learning, Thinking; Creativity

31 31 Knowledge-based INFORMATION EXTRACTION Cognition based INFORMATION VISUALIZATION Intelligence based on SCIENTOMETRICS/ BIBLIOMETRICS IR Intelligent: INFORMATION RETRIEVAL The triumvirate of understanding

32 32 Talk Outline  Computing Intelligently  The complexity of science  The triumvirate of understanding  Dealing with information deluge  The Missing Link: Images and Text  Need for/of the Grid  Afterword

33 33 Dealing with information deluge There are over 2,000 news wires produced by Reuters Financial together with on-line reports from banks, brokerage houses, regulatory bodies. Filtering the relevant from the not-so-relevant is a major problem. All major journals in science and technology, together with pre-prints, textbooks, conference proceedings, technical reports, research road-maps, (US) patent documents, are all available (almost) freely. Extracting relevant document from this intellectual deluge is challenging the limits of documentation and has a serious impact on innovation and technology transfer.

34 34 Dealing with information deluge The news report is one of the most commonly occurring linguistic expressions. Despite being a good example of open-world data, a news report is a contrived artefact: each report has a potentially attention grabbing headline; the opening few sentences generally comprise a good summary of the contents of the report; there are slots for the date of origin and slots for photographs and other graphic material.

35 35 Dealing with information deluge The relationship between Events, News and Markets (price) through Information.

36 36 Dealing with information deluge Movement from Feb 2001 to Jan 2002. Note the dip on and around Sep 11 th 2001, although all markets were falling before this.

37 37 Dealing with information deluge Francis Knowles has written about the use of health metaphors used in the financial news reports: markets are full of vigour and are strong or the markets are anaemic or are weak (1996); most newspapers also use animal metaphors – there are bull markets and bear markets, the former refer to expansion, and indirectly to fertility, and the latter to shy, retiring and grizzly behaviour much like that reported about bears in popular press and in literature for children.

38 38 Dealing with information deluge Mainly Good News StoriesRather Bad News Stories Naval shipbuilder and military contractor Vosper Thornycroft has boosted its civil arm by buying facilities manager Merlin Communications (Nov 14, 2001) Heavyweight banking and oil stocks have dropped up the leading share index as investors bet on fresh interest rate cuts.’ (Nov 21, 2001). The FTSE 100 stock index looks set to open stronger today after Wall Street added to gains seen at the London close and with U.S. stock index futures boosted by rumours that Osama bin Laden had been captured.’(Nov 15, 2001). The European Commission has slashed its official growth forecasts for the euro zone [..], predicting the most serious slowdown since the 1990s recession, with lower growth in 2002 than this year.’ (Nov 21, 2001).

39 39 Dealing with information deluge We created a corpus of 1,539 English financial texts from one source (Reuters) on the World Wide Web, published during a 3 month period (Oct 2001-January 2002) comprising over 310,000 tokens. The corpus comprised a blend of both short news stories and financial reports. Most of the news is business news from Britain with thirty percent of the news is from Europe and from the United States. Week (5 day week)Good Word Frequency Bad Word Frequency 15840 27175 37766 47359 57228 Total351268 Frequency of Good and Bad words in Nov 2001. The underlined figures in the 2 nd and 3 rd columns indicate the minimum value of the frequency and the numbers in italics are the maximum value.

40 40 Dealing with information deluge Market correlation between ‘good’ word frequency and FTSE index.

41 41 Dealing with information deluge Good and bad word frequency correlated with FTSE 100.

42 42 Dealing with information deluge SYSTEM QUIRK Reuters News Feed Up Down Time Series of Up and Down FTSE 100 INDEX Generate Signal (Buy / Sell) Ibermatica, Madrid Finsoft, London JRC GmBH, Berlin Partners This work is being carried out under the auspices of the EU-IST sponsored GIDA project. The project aims to create a novel service type in the financial investment business. Its novelty lies in the integration of financial analysis with news analysis

43 43 Dealing with information deluge FTSE 100 plotted against ‘bad news’  20 February 2002 one of the lowest days. The SATISFI system keeps track of news reports with bad (and good) news.

44 44 Dealing with information deluge SATISFI  Sentiment and Time Series: Financial analysis System is being developed at the University of Surrey for the EU-IST GIDA Project. Good News FTSE 100 SATISFI is based on our existing text analysis system, System Quirk, together with programs for time series analysis, text summarisation and organising large text collections, and programs for creating thesauri and term bases. Systems for learning the behaviour of the markets are also being developed.

45 45 Profiting from information deluge? See also: http://www.vicefund.com/http://www.vicefund.com/

46 46 Dealing with information deluge We have used a neural computing system that creates its own categories given a class of computational objects, say digitised, computer- understandable version of a set of news stories – a set of keywords representing the whole set. Some keywords will be present in some stories or absent from the stories. The system has to be trained on a set of keywords and creates categories. Then the system will categorise unseen stories into the categories it has already created.

47 47 Dealing with information deluge Our text corpus consisted of 100 Associated Press (AP) news wires selected from 10 pre- classified news categories shown together with their icons. The average length of the articles was 622 words. Automatic Categorization of Texts Based on Keywords Using a neural computing system

48 48 Dealing with information deluge Text Categories 1Bioconversion6 Exportation of Industry 2 Pollution Recovery 7Foreign Trade 3 Alternative Fuels 8 Int. Drug Enforcement 4Fossil Fuels9 Foreign Car Makers 5Rain Forests10 Worldwide Tax Sources Text categories used in the TIPSTER – SUMMARY program, but were not known to our system

49 49 Dealing with information deluge 1percent15mexico29mazda43enforcement 2tax16emissions30gases44warming 3billion17drugs31shale45smog 4drug18fuels32deficit46ozone 5reagan19senate33export47massachusetts 6cars20auto34recycling48imports 7taxes21proposal35epa49automobile 8environmental22gasoline36honda50trafficking 9pollution23exports37methanol 10fuel24vehicles38automakers 11federal25ohio39panama 12dukakis26greenhouse40corp 13bush27dioxide41forests 14congress28marine42cocaine Salient single words identified automatically by System Quirk

50 50 Dealing with information deluge Results of a Full Text Map trained using exponentially decreased neighbourhood and learning rate.

51 51 Dealing with information deluge Results of a Full Text Map trained using exponentially decreased neighbourhood and learning rate.

52 Evaluation of summary accuracy of 30 texts by 4 defence intelligence assessors Dealing with information deluge TEXT SUMMARISATION : Surrey’s Program Telepattern T HE P ROGRAMS WERE E VALUATED by the US DoD’s TREC AND TIPSTER Programmes

53 53 Talk Outline  Computing Intelligently  The complexity of science  The triumvirate of understanding  Dealing with information deluge  The Missing Link: Images and Text  Need for/of the Grid  Afterword

54 54 The Missing Link: Images and Text  The administration of justice requires systematic prosecution of the perpetrators of crime.  One key element in this system is the collection, analysis and dissemination of information collected safely and securely from the scene where the crime was committed.  The information comprises images of the scene, the descriptions and interpretations of these images.  In a murder case there maybe over 2000 scene of crime images and the case can take upto two years to come to courts. It is important for these images to be indexed appropriately and be retrieved efficiently. Scene-of-crime officers (SoCOs) play a key role in the collection of this vital multi-modal information; they describe the image and the context in which the images were collected. The police officers involved in the administration of justice provide the interpretation.

55 55 The Missing Link: Images and Text  The collateral texts – written texts or speech (fragments) closely or loosely related to an image or objects within the image. CLOSELY COLLATERAL TEXTS CAPTION CRIME SCENE REPORT BROADLY COLLATERAL TEXTS NEWSPAPER ARTICLE DICTIONARY DEFINITION  The collateral texts are special language texts and comprise keywords that may help in indexing and retrieving the images.

56 56 The Missing Link: Images and Text The EPSRC-sponsored SoCIS project, involving Universities of Surrey and Sheffield, is developing methods and techniques for automatically indexing images with the descriptions provided by Scene of Crime Officers. 9 mm browning high power pistol Footwear impression in blood Body on floor showing adjacent table Fingerprints showing ridges Typical Scene of Crime Images The SoCIS project is investigating how the results of the project can be generalised such that the methods and techniques can be applied to an arbitrary domain.

57 57 The Missing Link: Images and Text What SOCO’s do now? Forms, forms and more forms

58 58 The Missing Link: Images and Text The SoCIS project is developing methods and techniques for automatically indexing images taken at a crime scene with the descriptions provided by scene of crime officers. Five UK Police Forces are working closely with our project: They provide knowledge of their subject domain, test our system and advise us generally. Hampshire Constabulary Metropolitan Police Surrey Police South Yorkshire Police Kent Constabulary

59 59 The Missing Link: Images and Text The SoCIS project is developing methods and techniques for automatically indexing images with the descriptions provided by scene of crime officers. Edit ButtonShape Buttons Save Button Select Button Delete Button Show All Hotspots Button

60 60 D ESCRIBING I MAGES – THE L INK BETWEEN I MAGES AND T EXT, THE M ISSING L INK? SOCIS: A prototype image and text storage and retrieval system.  Automatic Labelling (or INDEXING ) of images by keywords in the descriptions provided by the SOCO’s.  Automatic Extraction of terms and their relationship to other terms (ontology) from the descriptions and other texts. EVIDENCE TRACE EVIDENCE FIBREBLOODDNA INORGANIC FIBRE MANUFACTURED POLYMERIC FIBRE DYE FIBRE The above hierarchy tree is based on our 0.7 million word forensic science text corpus

61 61 D ESCRIBING I MAGES – THE L INK BETWEEN I MAGES AND T EXT, THE M ISSING L INK? SANNC: A neural computing system that learns how to relate textual descriptions with images. Automatic Clustering of similar images in an image collection. Automatic Identification of the position of objects in an image or image. Nine millimetre browning high power self-loaded pistol SELF ORGANISING MAP HEBBIAN NETWORK IMAGE TEXT SELF ORGANISING MAP

62 62 D ESCRIBING I MAGES – THE L INK BETWEEN I MAGES AND T EXT, THE M ISSING L INK? IDENTIFICATIONLOCATIONELABORATION [1] Close up view of exhibit ABC/3 [.] [2] Red and silver knife handle. On alleyway floor Adjacent to building and metal gate [SOCO 1 – spontaneous free text:] Close up view of exhibit ABC/3 red and silver knife handle on alleyway floor adjacent to building and metal gate. Indexer Variability: Given the image descriptions are in free text, perhaps each SOCO gives a different description of the image?

63 63 D ESCRIBING I MAGES – THE L INK BETWEEN I MAGES AND T EXT, THE M ISSING L INK? SOCO5Close up item 3. SOCO7Close up of item 3 - SOCO1Close up of knife. SOCO8Close up view item 3 - SOCO2Close up view of ex 3 SOCO4Close up view of exhibit 3 SOCO3Close up view of exhibit ABC/3 SOCO6Close view of marker 3 Indexer Variability: Given the image descriptions are in free text, perhaps each SOCO gives a different description of the image? Not really: there are three ‘structures’ – identification, location and elaboration. The linguistic description shows little or no variation. Research continues.

64 64 Variation amongst SOCO’s? SOCO2a red handled lock knife SOCO6against red handled knife. SOCO5Knife handle. SOCO3red and silver knife handle SOCO4red handled flick knife SOCO8red handled flick knife. SOCO7red penknife. SOCO1Red sides. Metal ends. Indexer Variability: Given the image descriptions are in free text, perhaps each SOCO gives a different description of the image? Not really: there are three ‘structures’ – identification, location and elaboration. The linguistic description shows little or no variation. Research continues.

65 65 Talk Outline  Computing Intelligently  The complexity of science  The triumvirate of understanding  Dealing with information deluge  The Missing Link: Images and Text  Need for/of the Grid  Afterword

66 66 Need for/of the Grid  Data Grids Management of large volumes of text, images, financial data, ….  Computational Grids Processing of large volumes of such data  Collaborative Grids Activities in research – virtual crime investigation

67 67 Need for/of the Grid  Coordinating data sets based on common sets of metadata: need for standards beyond those for architecture of the Grid (OGSA)  Grid-enabling System Quirk would enable processing of large volumes of distributed data  Grids provide the infrastructure for development of generic computing applications capable of dealing with and combining results of analysis of various types of data

68 68 Talk Outline  Computing Intelligently  The complexity of science  The triumvirate of understanding  Dealing with information deluge  The Missing Link: Images and Text  Need for/of the Grid  Afterword

69 69 Afterword: The Department of Computing  Software Engineering  Theoretical Computing  Knowledge Management  Neural Computing  Information Extraction and Multi-media Group A research-active Department Applied to EPSRC to be involved with e-Science Programme Looking to develop industrial collaborations for ALL research activities

70 70 Afterword: The Department of Computing A Department that has or is looking forward to active collaboration within the University: Computer Vision (CVSSP – the new JIF Lab) Satellite Engineering (SSTL – Best Practice) Linguistics & Dance A Department that is looking forward to active collaboration outside the University with: Unis Sheffield, Southampton, Metropolitan Police College, Queen Mary London A Department that looking forward to exploit its software systems especially financial prediction systems, language engineering systems.


Download ppt "1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University."

Similar presentations


Ads by Google