A Suite to Compile and Analyze an LSP Corpus

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Chapter 5: Introduction to Information Retrieval
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Image Search Presented by: Samantha Mahindrakar Diti Gandhi.
Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Presented by Zeehasham Rasheed
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
An Overview of Relevance Feedback, by Priyesh Sudra 1 An Overview of Relevance Feedback PRIYESH SUDRA.
Information Retrieval
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1. An Overview of the Data Analysis and Probability Standard for School Mathematics? 2.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Search Engines and Information Retrieval Chapter 1.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
LIS510 lecture 3 Thomas Krichel information storage & retrieval this area is now more know as information retrieval when I dealt with it I.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
The Development of a search engine & Comparison according to algorithms Sungsoo Kim Haebeom Lee The mid-term progress report.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Search Engine Architecture
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Evaluation Anisio Lacerda.
Measuring Monolinguality
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Search Engine Architecture
Terminology problems in literature mining and NLP
Applying Key Phrase Extraction to aid Invalidity Search
Presented by Steven Lewis
Searching and browsing through fragments of TED Talks
Chapter 5: Information Retrieval and Web Search
Combining Keyword and Semantic Search for Best Effort Information Retrieval  Andrew Zitzelberger 1.
Search Engine Architecture
Machine Learning Model Constructor
Presentation transcript:

A Suite to Compile and Analyze an LSP Corpus 6th International Conference on Language Resources and Evaluation LREC 2008 Rogelio Nazar – Jorge Vivaldi – M. Teresa Cabré {rogelio.nazar; jorge.vivaldi; teresa.cabre}@upf.edu

Introduction This system (JAGUAR) is a set of tools for compiling and exploring an LSP corpus from the web http://jaguar.iula.upf.edu Usage Examples: Terminology extraction Bilingual lexicon extraction Neologisms extraction Architecture: a system divided in two main modules: Compilation of an LSP corpus from the web Analysis of the corpus with statistical techniques

Module 1: Compilation of an LSP corpus from the web Document retrieval by querying search engines Classification of the collection on the basis of two axis: Degree of relevance to the topic Possibility of corpus tuning with user feedback Degree of specialization of the document Structure of the document (abstract, introduction, etc.) System for bibliographical references, etc. Final classification is the result of the combination of these factors.

Module 1: Compilation of an LSP corpus from the web Classification by degree of relevance to the topic:

Module 1: Compilation of an LSP corpus from the web Classification by degree of relevance to the topic: coocurrence graphs

Module 1: Compilation of an LSP corpus from the web Evaluation of the documents classification: Cumulative precision in the ranking of documents with the term spastic diplegia.

Module 1: Compilation of an LSP corpus from the web Evaluation of the documents classification: Precision and Recall for the experiments.

Module 1: Compilation of an LSP corpus from the web Evaluation of the documents classification: Probability distribution of precision as a random variable (performance of 10.000 random classifiers).

Module 2: Analysis of the corpus with statistical techniques 1. Input: from module 1 or from user compiled corpus 2. Main functions: Measures of vocabulary richness Analysis of sample representativeness Automatic language recognition Kwic search N-grams extraction and sorting Collocations extraction Measures of association Models of term distribution Coefficients for vector comparison

http://rc16.upf.es/jaguar

Conclusions We have presented the system JAGUAR, set of tools for compiling and exploring an LSP corpus from the web The main characteristics of this suit are the following: It is able to collect an LSP corpus from the web, ensuring the thematic adequacy and degree of specialization to a given domain It offers tools to statistically explore such collection in a friendly interface It has also been conceived as a library The original algorithms have been successfully evaluated It usage save time and effort in the analysis of a corpus offering also new insights, a perspective of the data invisible to the naked eye.

Future Work Project is now growing in different directions: Progressive enhancement with new functions and algorithms Turning into a desktop application

Thanks!