Information and Communication Technologies Linguateca University of São Paulo ICMC / NILC 1 Yes, user! compiling a corpus according to what the user wants.

Slides:



Advertisements
Similar presentations
Universidade de São Paulo
Advertisements

Critical Reading Strategies: Overview of Research Process
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Evaluation Kleanthous Styliani
Information and Communication Technologies 1 Working with Portuguese corpora Diana Santos Linguateca
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.
Report Assessment AE Semester Two
© Tefko Saracevic, Rutgers University1 digital libraries and human information behavior Tefko Saracevic, Ph.D. School of Communication, Information and.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Writing the Research Report The purpose of the written report is to present the results of your research, but more importantly to provide a persuasive.
© Tefko Saracevic, Rutgers University1 digital libraries and human information behavior Tefko Saracevic, Ph.D. School of Communication, Information and.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Gender Issues in Systems Design and User Satisfaction for e- testing software Prepared by Sahel AL-Habashneh. Department of Business information systems.
Automating Assessment of Web Site Usability Marti Hearst Melody Ivory Rashmi Sinha University of California, Berkeley.
UCB CS Research Fair Search Text Mining Web Site Usability Marti Hearst SIMS.
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
Scientific method - 1 Scientific method is a body of techniques for investigating phenomena and acquiring new knowledge, as well as for correcting and.
© Tefko Saracevic, Rutgers University1 digital libraries and human information behavior Tefko Saracevic, Ph.D. School of Communication, Information and.
The Software Product Life Cycle. Views of the Software Product Life Cycle  Management  Software engineering  Engineering design  Architectural design.
A02 Creating my website NAME ______________. UNIT 2 – A02 – Creating my Website The purpose of this assessment objective is to create 5 web pages containing.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Website Content, Forms and Dynamic Web Pages. Electronic Portfolios Portfolio: – A collection of work that clearly illustrates effort, progress, knowledge,
1. Learning Outcomes At the end of this lecture, you should be able to: –Define the term “Usability Engineering” –Describe the various steps involved.
Educator’s Guide Using Instructables With Your Students.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Claudia Marzi Institute for Computational Linguistics, “Antonio Zampolli” – Italian National Research Council University of Pavia – Dept. of Theoretical.
Evaluating Online Information Sources Ask yourself the following questions…
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
Evaluation of software engineering. Software engineering research : Research in SE aims to achieve two main goals: 1) To increase the knowledge about.
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
What Do We Know about Defect Detection Methods P. Runeson et al.; "What Do We Know about Defect Detection Methods?", IEEE Software, May/June Page(s):
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Testing & modeling users. The aims Describe how to do user testing. Discuss the differences between user testing, usability testing and research experiments.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
The Next Generation Science Standards: 4. Science and Engineering Practices Professor Michael Wysession Department of Earth and Planetary Sciences Washington.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Search Engine Architecture
Information Retrieval Effectiveness of Folksonomies on the World Wide Web P. Jason Morrison.
BEHAVIORAL TARGETING IN ON-LINE ADVERTISING: AN EMPIRICAL STUDY AUTHORS: JOANNA JAWORSKA MARCIN SYDOW IN DEFENSE: XILING SUN & ARINDAM PAUL.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Evaluating Web Pages Techniques to apply and questions to ask.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
C. Lawrence Zitnick Microsoft Research, Redmond Devi Parikh Virginia Tech Bringing Semantics Into Focus Using Visual.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
From description to analysis
Unit 2 AS Sociology Research Methods Examination Technique.
Judging WebPages Anyone can post anything on the world wide web. Not all the information on a web page may be factual. There are many things to consider.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Evaluating Web Pages Techniques to apply and questions to ask.
Text Mapping for Technology Watch A research application Z. Jacobson, CA.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
ANALYSIS PHASE OF BUSINESS SYSTEM DEVELOPMENT METHODOLOGY.
Research Word has a broad spectrum of meanings –“Research this topic on ….” –“Years of research has produced a new ….”
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
Abstract  An abstract is a concise summary of a larger project (a thesis, research report, performance, service project, etc.) that concisely describes.
CSE 635 Multimedia Information Retrieval
Research Methods Introduction Jarod Locke.
Testing & modeling users
Web Mining Research: A Survey
Journal of Web Semantics 55 (2019)
Presentation transcript:

Information and Communication Technologies Linguateca University of São Paulo ICMC / NILC 1 Yes, user! compiling a corpus according to what the user wants Rachel Aires Diana Santos Sandra Aluísio Linguateca & USP

Information and Communication Technologies Linguateca University of São Paulo ICMC / NILC 2 A distributed resource center for Portuguese language technology First node at SINTEF ICT, Oslo, started in 2000 (work at SINTEF started 1998 as the Computational Processing of Portuguese project) IRE model Information Resources Evaluation Linguateca, a project for Portuguese Oslo Lisboa XLDB Braga Porto Lisboa LabEL Odense São Carlos Lisboa COMPARA

Information and Communication Technologies Linguateca University of São Paulo ICMC / NILC 3 Structure of the presentation The corpus at a glance The general motivation for the study that led to its creation The hypotheses of the study Empirical assessment of the hypotheses User-based assessment of the hypotheses Further work (in progress) Related studies

Information and Communication Technologies Linguateca University of São Paulo ICMC / NILC 4 Yes, user! A corpus of 1,703 Brazilian Web pages classified according to users’ goals in seven different (but not necessarily mutually exclusive) kinds Publically available from Plus: Several domain specific binary corpora in Portuguese of 200 Web pages each (100 positive, 100 negative), developed independently by several users (native speakers of Portuguese) Japanese cooking; fado; aviation; law; philosophy of language; fishing; surrealism; history; etc.

Information and Communication Technologies Linguateca University of São Paulo ICMC / NILC 5 Problem setting: NLP in IR Or more specifically: Portuguese processing for IR in Portuguese... The general question is: Does it help? Of the many questions/sub problems that could be chosen, we (Rachel’s PhD studies) concentrated on the specific one: Can NLP improve the presentation of the results to the user of a Web search engine? (In the usual IR tripartite model, there are three phases in IR to which NLP could be added: search, indexing and presentation). The last one has been the less studied as to NLP contribution.

Information and Communication Technologies Linguateca University of São Paulo ICMC / NILC 6 Problem setting: shortcomings of Web IR The problem is no longer just to find results, but also to navigate among their massive number in a meaningful way Web users are extremely varied and use the Web for all kinds of purposes and activities, so standard user testing in a usability lab with a sample of the population is obviously out of the question These two factors have led researchers to investigate other properties of Web documents other then their content (such as style, language quality, text genre, authority, novelty, maintenance/update rate) offer personalization solutions for specific kinds of search, for specific kinds of users, for specific kinds of content, as well as try to study the users and their behaviour using unobtrusive techniques The work reported here is in this line...

Information and Communication Technologies Linguateca University of São Paulo ICMC / NILC 7 The main hypotheses It is possible to distinguish among different users’ needs In general, different pages satisfy different users’ needs It is possible in a majority of cases to evaluate whether a particular page or site was designed to satisfy (or not) a given user’s need Users are aware of their needs (if asked to distinguish among several) It is possible to automatically classify pages according to user’s needs Users have higher satisfaction when getting results’ pages clustered around user’s needs Technical/philosophicalUser-related

Information and Communication Technologies Linguateca University of São Paulo ICMC / NILC 8 The technical hypotheses The three first have received some confirmation by the compilation of the Yes, user! corpus a typology of seven user’s needs was devised after careful qualitative consideration of the logs of a real Web search engine (Todobr) 4 different people looked at Web pages and classified them according to a set of precise guidelines with examples: although there was considerable overlap (many pages satisfied more than one need, no pages satisfied at once every need) The fifth hypothesis was put to test by developing automatic classifiers for texts satisfying each user need and testing them with ten cross-fold validation methodology in this corpus promising preliminary results in Aires et al. (2004) with a smaller corpus precision > 70% in the present corpus (Aires et al., SIGIR style workshop 2005)

Information and Communication Technologies Linguateca University of São Paulo ICMC / NILC 9 The user-related hypotheses Are being tested in two different ways: Questionnaire to prospective users: Devising and delivering it to graduate students Analysing the 63 answers User testing a prototype of a desktop meta searcher, Leva-e-Traz, to see whether users understand the concepts and whether their satisfaction with the search improves User-testing is in progress at the moment The prototype is working Before we show a screenshot of it, though, we have to say a little more:

Information and Communication Technologies Linguateca University of São Paulo ICMC / NILC 10 Other questions we wanted to investigate Would traditional genres (as used e.g. by Karlgren or Stamatatos et al.) be more understandable to users? Or just kinds of texts, in the sense of traditional names like “novel”, “biography”, “scientific paper”, “textbook”, etc., would be more in agreement with what the user wants? Aluísio et al. (2003) had developed a genre typology, which included also “text types”, for Brazilian Portuguese Web (texts) for use in the Lácio-Web corpus and we had also been able to develop automatic classifiers for them Would users spend half a day creating binary corpora, in order to improve their search results in the future? Or were they satisfied with current Web search engine’s performance? So we just asked users

Information and Communication Technologies Linguateca University of São Paulo ICMC / NILC 11 Leva-e-traz, desktop meta searcher prototype

Information and Communication Technologies Linguateca University of São Paulo ICMC / NILC 12 Related work We were inspired by Karlgren’s (2000) studies, and this work can be seen as a validation and confirmation of his original hypotheses, for Portuguese Other people have investigated automatic genre classification for other languages, as Stamatatos et al. (2000) for Modern Greek We were inspired by Biber’s (1988) empirical investigation of co-occurrence patterns among linguistic features, and, in fact, most of our features were directly inspired by (adapted from) those used by Biber There is a huge effort in personalization and user adaptation. Our work in binary corpora can be related to the adaptive IE system of Ciravegna & Wilks (2003) Finally, previous work in detecting users goals (Aires & Aluísio, 2002), in proposing evaluation contests to characterize the Web in Portuguese (Aires et al., 2003), and measuring its size (Aires & Santos, 2002) is also relevant

Information and Communication Technologies Linguateca University of São Paulo ICMC / NILC 13 Further work We have parsed Yes, user! with the PALAVRAS parser (Bick, 2000), in order to investigate other relevant features for the classification in seven user needs. Although we do not expect in the near future to have a parser quick enough to do this in real time, this is a relevant question to investigate If our seven needs classification proves useful, we may add it later to the indexing phase of a search engine, and in that case one might consider the possibility of parsing the contents of the pages before indexing We also consider the hypothesis of adding further structural clues (such as provided by HTML and presence or absence of images) to our classifiers in order to improve the classification precision

Information and Communication Technologies Linguateca University of São Paulo ICMC / NILC 14 Concluding remarks It is feasible to classify Web pages according to the kinds of goals users have, as well as according to genre or type of text dimensions We have created Yes, user!, a corpus in Brazilian Portuguese classified according to these criteria that we make available for other researchers to use and improve We have also created a set of small corpora, also available, in order to test adaptation to user’s specific interests We are currently testing whether these kinds of classification can be put to advantage in real user’s life by using a meta searcher which provides the results classifying according to these parameters and do user testing