Generating topic chains and topic views: Experiments using GermaNet Irene Cramer, Marc Finthammer, and Angelika Storrer Faculty.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme Presented by Smitashree Choudhury.
An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues Irene Cramer & Marc Finthammer Faculty of Cultural.
Information Retrieval in Practice
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
Brian A. Carlsen Apelon, Inc. Tools For Classification Integration Networked Knowledge Organization Systems/Services Workshop June 28, 2001.
A System for A Semi-Automatic Ontology Annotation Kiril Simov, Petya Osenova, Alexander Simov, Anelia Tincheva, Borislav Kirilov BulTreeBank Group LML,
Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures Written by Alexander Budanitsky Graeme Hirst Retold by.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Sangam: A Transformation Modeling Framework Kajal T. Claypool (U Mass Lowell) and Elke A. Rundensteiner (WPI)
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Overview of Search Engines
Indexing Overview Approaches to indexing Automatic indexing Information extraction.
What Linguists Want (we think) Helen Aristar Dry & Anthony Aristar LINGUIST List & E-MELD.
Information Retrieval in Practice
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Teaching Metadata and Networked Information Organization & Retrieval The UNT SLIS Experience William E. Moen School of Library and Information Sciences.
Introduction to Information Retrieval CS 5604: Information Storage and Retrieval ProjCINETViz by Maksudul Alam, S M Arifuzzaman, and Md Hasanuzzaman Bhuiyan.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Sharad Oberoi and Susan Finger Carnegie Mellon University DesignWebs: Towards the Creation of an Interactive Navigational Tool to assist and support Engineering.
Taxonomies of Visualization Techniques CMPT 455/826 - Week 12, Day 2 w12d2 Sept-Dec
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University.
Querying Structured Text in an XML Database By Xuemei Luo.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Search Engine Architecture
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
LOGO Summarizing Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
© Copyright 2013 STI INNSBRUCK “How to put an annotation in HTML?” Ioannis Stavrakantonakis.
Towards the Semantic Web 6 Generating Ontologies for the Semantic Web: OntoBuilder R.H.P. Engles and T.Ch.Lech 이 은 정
Using Semantic Relatedness for Word Sense Disambiguation
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Iana Atanassova Research: – Information retrieval in scientific publications exploiting semantic annotations and linguistic knowledge bases – Ranking algorithms.
CSC 594 Topics in AI – Text Mining and Analytics
Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov 2, Birgitta.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Link Distribution on Wikipedia [0407]KwangHee Park.
Utilizing vector models for automatic text lemmatization Ladislav Gallay Supervisor: Ing. Marián Šimko, PhD. Slovak University of Technology Faculty of.
Semantic search-based image annotation Petra Budíková, FI MU CEMI meeting, Plzeň,
Text Summarization using Lexical Chains. Summarization using Lexical Chains Summarization? What is Summarization? Advantages… Challenges…
Topical Analysis and Visualization of (Network) Data Using Sci2 Ted Polley Research & Editorial Assistant Cyberinfrastructure for Network Science Center.
Integrating linguistic knowledge in passage retrieval for question answering J¨org Tiedemann Alfa Informatica, University of Groningen HLT/EMNLP 2005.
Information Retrieval in Practice
Dr Anie Attan 26 April 2017 Language Academy UTMJB
Linguistic Graph Similarity for News Sentence Searching
Search Engine Architecture
Search Engine Architecture
Exploring and Navigating: Tools for GermaNet
Hansheng Xue School of Computer Science and Technology
Associative Query Answering via Query Feature Similarity
Toshiyuki Shimizu (Kyoto University)
Search Engine Architecture
Presentation transcript:

Generating topic chains and topic views: Experiments using GermaNet Irene Cramer, Marc Finthammer, and Angelika Storrer Faculty of Cultural Studies, Dortmund University of Technology, Germany

Outline  Project context  Concept of topic chains and topic views  Construction of topic views  Evaluation

Project context  Project HyTex on text-grammatical foundations for the (semi-)automated text-to-hypertext conversion  Research line in this context: topic-based linking strategies using lexical chaining as a resource

Concept of topic chains Partial text representation based on selection of thematically central words (multi word units), so called topic items Instrument for visualization of thematic development in text segments (meant as analysis tool for linguists) paragraph 1 text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text … hyponymy meronymy

Concept of topic views Thematic index based on text grammatical information constructed of a selection of topic items Intended to support the user’s orientation and navigation Chapter 1.1 text topic item text text text text text text text text text text text text text text text text text text topic item text text text text text text text text text text text text text text topic item text text text text text text text text text text Chapter 1.1 topic item 1 topic item 2 topic item 3 Chapter 1.2 topic item 1 topic item 2 topic item 3 Chapter 1.3 … Chapter 2 … Chapter 3.1 … Chapter 1.2 text topic item text text text text text text text text text text text text text text text text text text text text text topic item text text text text text text text text text topic item text text text text text text text text text text text text

Example: Kinderarmut Example: newspaper article about child poverty in Germany Topic items according method in initial experiments Kind, Engl. child Geld, Engl. money Deutschland, Engl. Germany

Construction of topic chains and views use all topic items per paragraph for another chaining step  topic chain 1-3 best topic items per paragraph  topic view output lexical chaining + additional features ↕ net representing with topic items as nodes and semantic relatedness as edges

Basis of our approach: GLexi Our lexical chainer GLexi: –modular architecture:  linguistic preprocessing, XML-input  core algorithm with interface for integration of various resources / relatedness measures  output generation – several formats (e.g. XML and visual graph representation) –evaluation wrt. coverage, disambiguation quality, performance of semantic relatedness measures, and application (see Cramer & Finthammer, 2008)

Basis of our approach: GLexi Principle parameter settings: preprocessing (using TEMIS tools): lemmatization, morphological analysis, POS-Tagging resources: GermaNet, GermaTermNet (extension of GN with terminology), Google co-occurrence counts 11 semantic relatedness measures: -8 based on GermaNet (or GermaTermNet): Graph-Path, Tree-Path, Wu-Palmer (1994), Leacock-Chodorow (1998), Hirst- StOnge (1998), Resnik (1995), Jiang- Conrath (1997), Lin (1998) -3 based on Google: Google-Quotient, Google-NDG, Google-PMI Parameter setting for construction of topic views / topic chains* all preprocessing steps (using TEMIS tools) GermaTermNet Lin’s measure (based on GermaTermNet) thresholds 0.4 not_related – related and 0.7 related – strongly related * decision based on experiments reported in Cramer & Finthammer, 2008

Construction of topic views - overview Intuition – topic item:  lexical item central for topic(s) in paragraph Automatic selection of topic items:  select relevant lexical items per paragraph  topic item candidates, called TIC  build network with TICs as nodes and weighted edges based on GLexi  remove edges with low relatedness values (according to our threshold)  select topic items

Construction of topic views - overview Criteria for topic item selection: –parameters:  relative frequency,  density in TIC-net,  relation strength in TIC-net –use linear combination to calculate topic relevance values for each TIC using these parameters –derive ranking on basis of topic relevance values

Construction of topic views – step-by-step paragraph 1 text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text … paragraph 2 text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text …

Construction of topic views – step-by-step paragraph 1 text text text text text text text text text tic text text text text text text text text text text text text text tic text text text text text text text text tic text text text text text text text text text text tic text text text text text text text text text text text text text tic … paragraph 2 text text text text tic text tic text text text text text text text text text text tic text text text tic text text text text text text text text text text tic … … 10 tic (= topic item candidates)

Construction of topic views – step-by-step paragraph 1 text text text text text text text text text tic text text text text text text text text text text text text text tic text text text text text text text text tic text text text text text text text text text text tic text text text text text text text text text text text text text tic … paragraph 2 text text text text tic text tic text text text text text text text text text text tic text text text tic text text text text text text text text text text tic … related strongly related not related

Construction of topic views – step-by-step paragraph 1 text text text text text text text text text tic text text text text text text text text text text text text text tic text text text text text text text text tic text text text text text text text text text text tic text text text text text text text text text text text text text tic … paragraph 2 text text text text tic text tic text text text text text text text text text text tic text text text tic text text text text text text text text text text tic … related strongly related

Construction of topic views – step-by-step paragraph 1 text text text text text text text text text tic text text text text text text text text text text text text text tic text text text text text text text text tic text text text text text text text text text text tic text text text text text text text text text text text text text tic … paragraph 2 text text text text tic text tic text text text text text text text text text text tic text text text tic text text text text text text text text text text tic … related strongly related

Construction of topic views – step-by-step paragraph 1 text text text text text text text text text tic text text text text text text text text text text text text text tic text text text text text text text text tic text text text text text text text text text text tic text text text text text text text text text text text text text tic … paragraph 2 text text text text tic text tic text text text text text text text text text text tic text text text tic text text text text text text text text text text tic … related strongly related

Construction of topic views – step-by-step paragraph 1 text text text text text text text text text tic text text text text text text text text text text text text text tic text text text text text text text text tic text text text text text text text text text text tic text text text text text text text text text text text text text tic … paragraph 2 text text text text tic text tic text text text text text text text text text text tic text text text tic text text text text text text text text text text tic … criteria for selection + ranking: 1. relative frequency 2. density 3. relation strength

Construction of topic chains and views use all topic items per paragraph for another chaining step  topic chain 1-3 best topic items per paragraph  topic view

Evaluation Manual annotation of topic items in part (80 paragraphs) of HyTex core corpus (annotator agreement: approx. 70 %)  gold standard for evaluation of automatic extraction Automatic extraction of topic items in part (107 paragraphs) of HyTex core corpus overlap with manual annotation 1 ≤ 2/3 and ≥ 1/3 0 Doc 1 (39 par.)23 %59 %18 %82 % Doc 2 (49 par.)18 %53 %29 %71 % Doc 3 (19 par.)53 %32 %16 %85 % mean all 3 docs31 %48 %21 %79 %

Outlook Observations:  if all relevant words in the paragraph are appropriately represented in the lexical semantic resource … then performance of automatic topic item extraction is good  the longer a paragraph, the better the extraction of topic items Challenges in automatic topic item extraction:  Named Entities  technical terms  multi word preprocessing Plans:  integration of topic views as new navigation tool into HyTex demo prototype  experiments on refinement of manual annotation and automatic extraction, especially, more features in TIC selection such as mark-up and tf/idf-methods for density and strength

Thank you! More information about our research can be found at our project web-pages: