We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byOmar Frost
Modified about 1 year ago
Bogdan Vrusias © 2003 Scene of Crime Information System: Playing at St. Andrews 22nd August 2003 Bogdan Vrusias, Mariam Tariq, Lee Gillam Department of Computing University of Surrey England
Bogdan Vrusias © 2003 Talk Outline SoCIS Handling Multilinguality Synonymy and Morphology Relevance Ranking Performance Issues Results and Evaluation Conclusions & Future
Bogdan Vrusias © 2003 SoCIS The EPSRC-funded Scene of Crime Information System (SoCIS) project was running from October 1999 to March 2003. The aim was to study the link between images and texts within a specialist domain context. A method has been outlined for developing an intelligent content-based image retrieval (CBIR) system, which can store and retrieve images based on the linguistic descriptions of the images. The corpus-based method uses the lexical and semantic properties of specialist texts for extracting key terms and for discovering the ontological organisation of the terms.
Bogdan Vrusias © 2003 SoCIS The system, which is based on a 3-tier architecture of client, server, and database, can be accessed via a local intranet. SoCIS is an intelligent CBIR system that automatically: –labels (and indexes) images by keywords as well as relational facts extracted from the descriptions provided by domain experts; –extracts physical features of an image; –populates a database comprising domain-specific terminology, together with the semantic relationships between terms, starting from a random selection of collateral texts of the domain; and –learns to link image and text by using neural networks
Bogdan Vrusias © 2003 SoCIS SoCIS has integrated modules from –System Quirk (Ahmad & Rogers, 2001) - a set of tools for building and managing multilingual term bases with the use of powerful text analysis techniques, and –GATE (Cunningham et al., 2002) - a framework and graphical development environment comprising robust NLP tools. The main advantages that SoCIS can be said to have over other text-based and CBIR systems is its ability to extract information from both texts and images, to encode this information for indexing, and to build thesauri, all automatically.
Bogdan Vrusias © 2003 Adapting SoCIS SoCIS was specifically targeted at the use of specialist languages. The system has been built based on the knowledge gathered from Scene of Crime experts –from the testing and evaluation sessions performed with them, and –from a domain-specific text corpus. The system had to be adapted to deal with multilinguality as well as structured data from a more general domain for the ImageCLEF collection. The indexing module was used to extract single and compound terms from the output of the parser. We used Wordnet for query expansion purposes but the indexing had to be carried out without using a terminology dictionary to filter out invalid terms. A relevance ranking mechanism was adopted to handle the expanded terms retrieved from Wordnet.
Bogdan Vrusias © 2003 Handling Multilinguality We relied upon translation engines as found on the Internet. Google’s translation tools encountered difficulties. Altavista’s Babelfish was selected as the principal translation engine (http://babelfish.altavista.com/) However since Altavista’s Babelfish does not translate Dutch, FreeTranslation.com (http://www.freetranslation.com/) To translate the queries, Java code was used to wrap definitions of the query syntax used by these sites. Using the Java JTidy utility, the resulting HTML was converted to XML and XSLT.
Bogdan Vrusias © 2003 Handling Multilinguality Certain of these translations will cause problems with the retrieval. E.g. "Golf course bunkers": The quality of returned translation will therefore have a significant impact on the results being returned. German Golf course shelter French Bunkers of ground of gulf Italian (1) A bunker in a distance of golf Italian (2) bunkers in a golf course Spanish (1) B??nkers in a golf course Spanish (2) Track of golf Dutch Bunkers on a wave job
Bogdan Vrusias © 2003 Synonymy and Morphology A program was written to query a Wordnet database to provide a set of synonyms and hyponyms for each of the query terms. Given a query term, the program returns all the words in the synset that the particular term is an element of, as well as all the hyponyms of each synset element to a specified level in the hierarchy. Initially we planned to go down 2 levels in the hierarchy but ended up using just the synonyms due to system performance issues related to the large number of expanded terms returned. Some basic morphological analysis was also carried out for each query term to account for the use of variants such as singular or plural terms as well as the verb or adjective forms.
Bogdan Vrusias © 2003 Synonymy and Morphology Taking the query “Boats on Loch Lomond” as an example: The term ‘boat’ returned 53 expanded words going down one level in the hierarchy. –Synonyms returned were: travel on water, sauceboat, gravy boat. –Hyponyms returned were: motorboat, mail boat, mailboat gondola, propel by oars, propel by paddles, yacht, and so on. ‘Loch’ returned one synonym lough. ‘Lomond’ was not present since it is a proper noun. The very common term ‘man’ had 131 expanded words going down one level and 344 expanded words going down two levels with words such as –private, make swollen, belly out, candy striper, Homo erectus, clothes horse, ridicule with a satire, and gentleman.
Bogdan Vrusias © 2003 Relevance Ranking Each keyword carried a proportion of its frequency in an annotation divided by the total number of terms allocated to this annotation. The original keyword was then multiplied with weight 1, each expanded term (synonyms) returned by WordNet with weight 0.9, and words containing substrings of the original keywords with weight 0.1. The total ranking was then given by: Where f td is the term frequency of term t in document d, w t is the weight of a term t as described previously, and N d is the total number of words in document d.
Bogdan Vrusias © 2003 Performance Issues The system has been designed for the analysis of free text in specialist domains whereas with the ImageCLEF collection we were dealing with structured texts in a general domain. The indices produced were relatively unreliable due to the different syntactic structure of the ImageCLEF text when compared to free text. Due to the fact that we used Wordnet for query expansion, we encountered problems associated with polysemous words as well as different word forms. Due to the amount of time it was taking to process the expanded queries (some times reaching up to 300 words) we had to limit the expansion to just synonyms of the original query terms. We had six computers running in parallel to finish the processing, which was taking approximately 8 hours per language.
Bogdan Vrusias © 2003 Results and Evaluation A system that in principle would allow a user to query a collection of images that have been annotated in English, using a query in one of six languages has been prototyped. Across all languages, the following sets of results were obtained (missing topics and quantities for that topic are given in the third column) Spanish105 / 11732 (3), 33 (1), 34 (1), 36 (1), 39 (2), 43 (3), 47 (1) English48 / 5040, 46 French47 / 517, 17, 25 Italian91 / 10313 (2), 17 (1), 27 (3), 29 (2) 31 (1), 39(1), 43 (1), 45 (1), German43 / 504, 7, 13, 27, 40, 46, 48 Dutch38 / 505, 7, 13, 17, 18, 20, 27, 29, 36, 39, 40, 43 Total372 / 421
Bogdan Vrusias © 2003 Results and Evaluation For Topic 14, the top 5 results have been taken once for each language, and the similarity matrix between these results is as follows. These results show degrees of similarity between the English, Italian and Dutch results, with German and Spanish showing similarities, and French showing the most marked behavioural difference. EnFrDeEsItNlTotal 1673721 12 4 121341 113 142113 323 223014 433 164305 553 29031 33 2 16014 44 2 12138 55 2 16009 2 1 9586 3 1 9587 4 1 4618 5 1 13150 1 1 16833 2 1 5702 2 1 20573 41
Bogdan Vrusias © 2003 Results and Evaluation Taking a list of the exemplar images for retrieval, the ranking (where it exists) of that image within the 1000 results for each language was considered. For each language, if the exemplar image was retrieved within the first 1000, this was counted. If it was retrieved within the top 100 results, this was also noted. The following table presents the results obtained. HighLowAveIn top 100 Nl205017971319.557 De215011973309.0511 En28508798257.9612 Fr335111995337.5211 It421031967353.1414 Es511177884314.5219
Bogdan Vrusias © 2003 Note on Text & Image Retrieval Increasingly, images are being indexed and retrieved by both their visual content and by related texts such as captions that describe the image. Image descriptors extracted directly from image data (colour, texture and shape) tend to capture little of an image’s semantic content – hence there is a need to extract information about the image content from collateral texts. Information fusion has proven that in some cases it increases the retrieval of relevant data.
Bogdan Vrusias © 2003 Future Work Developing grid-enabled systems where the different processing modules, as well as instances of the same module, could be run as a service, in parallel, which would significantly improve the processing time. The ranking mechanism needs to be further refined and tuned by carrying out more trial runs. To improve the query expansion, one suggestion could be to use part-of- speech information from the query sentence to filter out some of the irrelevant expanded terms. We are investigating methods of effectively combining text-based with image-based retrieval techniques. This technique incorporated into a system that learns how to index, would improve performance. We are investigating the creation of multimedia thesauri.
Bogdan Vrusias © 2003 Questions?
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
“How much context do you need?” An experiment about context size in Interactive Cross-language Question Answering B. Navarro, L. Moreno-Monteagudo, E.
Automatic indexing and retrieval of crime-scene photographs Katerina Pastra, Horacio Saggion, Yorick Wilks NLP group, University of Sheffield Scene of.
A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City.
Queensland University of Technology An Ontology-based Mining Approach for User Search Intent Discovery Yan Shen, Yuefeng Li, Yue Xu, Renato Iannella, Abdulmohsen.
Terminology and documentation* Object of the study of terminology: analysis and description of the units representing specialized knowledge in specialized.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
NLP And The Semantic Web Dainis Kiusals COMS E6125 Spring 2010.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
The CLEF 2003 cross language image retrieval task Paul Clough and Mark Sanderson University of Sheffield
A Language Independent Method for Question Classification COLING 2004.
1 Query Operations Relevance Feedback & Query Expansion.
1 Overview of Search Engines Rong Jin. 2 Search Engine Architecture Consists of two major processes Indexing process Query process.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme Presented by Smitashree Choudhury.
The Jikitou Biomedical Question Answering System: Using a Syntactic Parser to Rank Possible Answers Michael A. Bauer 1,2, Daniel Berleant 1, Robert E.
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
BIT 3193 MULTIMEDIA DATABASE CHAPTER 4 : QUERING MULTIMEDIA DATABASES.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.
A Genetic Algorithm-Based Approach to Content-Based Image Retrieval Bo-Yen Wang( 王博彥 )
Multilingual Information Exchange APAN, Bangkok 27 January 2005
Natural Language Processing for Information Retrieval -KVMV Kiran ( ) -Neeraj Bisht ( ) -L.Srikanth ( )
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov 2, Birgitta.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Presentation Outline Project Aims Introduction of Digital Video Library Introduction of Our Work Considerations and Approach Design and Implementation.
Extraction and Visualisation of Emotion from News Articles Eva Hanser, Paul Mc Kevitt School of Computing & Intelligent Systems Faculty of Computing &
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.
Quality Control for Wordnet Development in BalkaNet Pavel Smrž Faculty of Informatics, Masaryk University in Brno, Czech.
© 2017 SlidePlayer.com Inc. All rights reserved.