Which of the two appears simple to you? 1 2.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Modern Information Retrieval Chapter 1: Introduction
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Ranking Text Documents Based on Conceptual Difficulty using Term Embedding and Sequential Discourse Cohesion Shoaib Jameel, Wai Lam, Xiaojun Qian Department.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 Statistical correlation analysis in image retrieval Reporter : Erica Li 2004/9/30.
Interfaces for Selecting and Understanding Collections.
Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA) Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Information Retrieval
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Chapter 5: Information Retrieval and Web Search
By Kousar Taj A Seminar Paper on LITERATURE REVIEW.
Learning to Rank for Information Retrieval
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.
What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect.
Search Engines and Information Retrieval Chapter 1.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010.
Finding Similar Questions in Large Question and Answer Archives Jiwoon Jeon, W. Bruce Croft and Joon Ho Lee Retrieval Models for Question and Answer Archives.
Bio-Medical Information Retrieval from Net By Sukhdev Singh.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
Shoaib Jameel, Wai Lam and Xiaojun Qian The Chinese University of Hong Kong Ranking Text Documents Based on Conceptual Difficulty Using Term Embedding.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Amy Dai Machine learning techniques for detecting topics in research papers.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Domain-Specific Iterative Readability Computation Jin Zhao 13/05/2011.
Chapter 6: Information Retrieval and Web Search
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information.
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Natural Language Processing Topics in Information Retrieval August, 2002.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Queensland University of Technology
Neighborhood - based Tag Prediction
Clustering of Web pages
When the subjects of metadata embrace the statistical learning
Text & Web Mining 9/22/2018.
When the subjects of metadata embraces the statistical learning
Chapter 5: Information Retrieval and Web Search
Probabilistic Latent Preference Analysis
INF 141: Information Retrieval
A Neural Passage Model for Ad-hoc Document Retrieval
Presentation transcript:

Which of the two appears simple to you? 1 2

Search for a keyword Results – Sometimes irrelevant and mixed order of readability

Our Objective Query Retrieve web pages (considering relevance) Re-rank web pages based on readability Automatically accomplished

An Unsupervised Technical Readability Ranking Model by Building a Conceptual Terrain in LSI Shoaib Jameel Xiaojun Qian The Chinese University of Hong Kong This is me!

What has been done so far? Heuristic Readability formulae Unsupervised approaches Supervised approaches My focus in this talk would be to cover some popular works in this area. Exhaustive list of references can be found in my paper.

Heuristic Readability Methods Have been there since 1940’s Semantic Component – Number of syllables per word, length of the syllables per word etc. Syntactic Component – Length of sentences etc.

Example – Flesch Reading Ease Semantic component Syntactic component Manually tuned numerical parameters

Supervised Learning Methods Language Models SVMs (Support Vector Machines) Use of query Log and user profiles

Smoothed Unigram Model [1] [1] K. Collins-Thompson and J. Callan. (2005.) "Predicting reading difficulty with statistical language models". Journal of the American Society for Information Science and Technology 56(13) (pp ). Recast the well-studied problem of readability in terms of text categorization and used straightforward techniques from statistical language modeling.

Smoothed Unigram Model Limitation of their method: Requires training data, which sometimes may be difficult to obtain

Domain-specific Readability Jin Zhao and Min-Yen Kan Domain-specific iterative readability computation. In Proceedings of the 10 th annual joint conference on Digital libraries (JCDL '10). Based on web-link structure algorithm HITS and SALSA. Xin Yan, Dawei Song, and Xue Li Concept-based document readability in domain specific information retrieval. In Proceedings of the 15 th ACM international conference on Information and knowledge management (CIKM '06). Based on an ontology. Tested only in the medical domain Hypertext Induced Topic SearchStochastic Approach for Link-Structure Analysis I will focus on this work.

Overview The authors state that Document Scope and Document Cohesion are an important parameters in finding simple texts. The authors have used a controlled vocabulary thesaurus termed as Medical Subject Headings (MeSH). Authors have pointed out the readability based formulae are not directly applicable to web pages.

MeSH Ontology Concept difficulty increases Concept difficulty decreases

Overall Concept Based Readability Score where, DaCw = Dale-Chall Readability Measure PWD = Percentage of difficult words AvgSL = Average sentence length in d i Their work focused on word level readability, hence considered only the PWD len(c i,c j )=function to compute shortest path between concepts c i c j in the MeSH hierarchy N = total number of domain concepts in document d i Depth(c i )=depth of the concept c i in the concept hierarchy D= Maximum depth of concept hierarchy

Our “Terrain-based” Method So, what’s the connection?

Latent Semantic Indexing Core component – Singular Value Decomposition SVD(C) = USV T C =

SVD(C) = USV T US VTVT

Three components Term Centrality – Is the term central to the document’s theme? Term Cohesion – Is the term closely related with the other terms in the document? Term Difficulty – Will the reader find it difficult to comprehend the meaning of the term?

Term Centrality Closeness of the term vector with the document vector in the latent space. T1 T2 D More central Less central LSI latent space Term Centrality = 1 / {Euclidean distance (T1,D)+small constant}

Term Cohesion T2 T1 T3 T4 D LSI Latent Space Distance Normalization is done to standardize the values. Term cohesion is obtained by computing the Euclidean distance between two consecutive terms T1 and T2

Term Difficulty Term Difficulty = Term Centrality x Inverse Document Frequency (idf) Idea of idf – If a term is used less often in the document collection, it should be regarded as important. For example, ‘’proton’’ will not occur too often but it is an important term.

So, what we now obtain? Term Difficulty Term Cohesion A reader now has to hop from one term to the other in the LSI latent space Something like this ->

How ranking is done? Keep aggregating the individual transition scores. Finally, obtain a real number which will be used in ranking.

Experiments and Results Collect web pages from domain-specific sites such as Science and Psychology websites. Test in two domains. Used NDCG as an evaluation metric. Retrieve relevant web pages given a query Annotate top ten web pages Re-rank search results based on readability

Results - Psychology

Results - Science

END