Readability: An application in Information Retrieval Lorna Kane Intelligent Information Retrieval Group, School of Computer Science and Informatics, University.

Readability: An application in Information Retrieval Lorna Kane Intelligent Information Retrieval Group, School of Computer Science and Informatics, University College Dublin

Traditional IR  The core aim in Information Retrieval is to match a user’s information need to the most relevant documents  Traditionally, IR systems have concentrated on topical relevance  For example, the query “HIV AIDS” should return documents that treat the topic of HIV and the AIDS virus  IR Systems are now doing a good job at finding topically relevant items

Traditional IR system Document collectionInformation need Query formulation User relevance assessment Query Processing Document Representation Files / Indexes System Functions Topical Matching e.g. Term Frequency Document Indexing

So…why is it hard?  Semantics Bank of Ireland, bank note, bank (flight manoeuvre), bank (of a river) Semantics of an image? Semantics of video? Semantics of music?  Natural Language Paris is the capital of France Bordeaux is the wine making capitol of France.  Opinion George Bush is honest. Geri Haliwell is talented. Big Brother is interesting. Van Gogh’s self portrait represents happiness  Context The users context will influence how they formulate their query  Information Gap User may be unable to unambiguously articulate information need User may under-specify query Talented? Interesting? Honest?

Document and Query Processing 1. Begin with Natural Language 2. Term Normalisation Remove case Remove Punctuation Alphabetise Twinkle, twinkle, little bat. How I wonder what you’re at? Up above the world you fly, like a tea-tray in the sky. a above at bat fly how like little sky tea the tray twinkle up what wonder world you youre

Document and Query Processing (contd.) 3. Stopword removal Remove words that carry little meaning such as connectives, articles and prepositions Words in english follow Zipf distribution, i.e. a few words appear very frequently, a medium number of words appear with a medium frequency and many words appear infrequently High Frequency words are useless because they describe too many objects Very low frequency words may be too rare to be of value a above at bat fly how like little sky tea the tray twinkle twinkle up what wonder world you youre 25 WORDS twinkle twinkle little bat wonder World high like tea tray sky 11 WORDS

Document and Query Processing (contd.) 4. Stemming reduction of morphological variants of a word to a common stem Will generate some errors but reduces size index files and provides a way to find variants of search terms  Over stemming: organisation  organ Create Creative Creation Creating Automatic Conflation e.g. Porter Stemming Algorithm Creat

Document and Query Processing (contd.) 5. Term Weighting TF (term frequency)  within a single document  gives high values for frequent terms  e.g. our document is mostly about “twinkl” because that term occurs most frequently tf di = num i IDF (inverse document frequency)  Throughout the document collection  gives high values for infrequent terms  e.g. in a collection of medical articles the word “pathology” will occur in most documents and therefore does not distinguish documents Idf i = log (N / df i ) N = number documents in collection df i = number of documents that contain the term

Document and Query Processing (contd.) 6. We can combined tf and idf to get term weight for each document: weight di = idf i * tf di 7. Document Matrix Each document / query is a vector of term weights TwinklLittlBatTea Doc 10.0680.02 0.01 Doc 20000.001 Doc 300.0220.110.002 Doc 40.04200.040

Vector Space Model  Document and Query matrices are represented as vectors in n- dimensional space (N = num unique words in collection) term 1 term 3 term 2 term 4 term 5 Doc 1 Doc 2 Doc 3 Query

Finally, Retrieval  The closer a query is to a document the better the query matches that document  Ranked list of topically relevant documents computed using distance metrics and returned to user term 1 term 3 term 2 term 4 term 5 Doc 1 Doc 2 Doc 3 Query

The Relevance Melting Pot  However! Relevance has been shown to be a multi-faceted concept.  The relevance of a document to a given query is influenced by the users context.  The user judges relevance by a number of criteria aside from topic

Readability as relevance criteria  Relevance Criteria listed in various user studies… Cool et al. (1993)  Topic: deep / superficial  Content: explanation, level of detail  Presentation: userstandability, simplicity / complexity, technicality Barry (1998)  Users judgement that he/she will be able to understand or follow the information presented  The extent to which information is presented in a clear or readable manner  The extent to which information presented is novel to the user Schamber (1998)  Information is specific to user’s need; has sufficient detail or depth  Information is presented clearly, little effort to read or understand

Relevance in Context  So… we can conclude that users want documents that they can understand and have the right amount of detail as well being topically relevant.  For example, someone who knows very little about AIDS may not be able to understand the following excerpt, i.e. it is irrelevant in their context “The development of OIs during HIV disease not only indicates the degree of immunosuppression, but may also influence disease progression itself. When stratified by CD4 counts, patients with prior histories of OIs have higher mortality rates than those without prior histories of OIs”

Zones of Learnability  This follows Walter Kintsch’s 1994 “zones of learnability” hypothesis,  “If a student’s knowledge overlaps too much with an instructional text, there is simply not enough for the student to learn from the text. If there is no overlap, or almost no overlap, there can be no learning either: the necessary hooks in the students’ knowledge, onto which new information is hung, are missing. ”  As such an IR system should try to match a user with a given level of domain knowledge to documents that they can learn the most from, documents that have the optimum balance of redundant and new information

How can we achieve such a match?  The ideas thus presented relate closely to the concept of Readability  Readability is a characteristic of text documents.. “the sum total of all those elements within a given piece of printed material that affect the success a group of readers have with it. The success is the extent to which they understand it, read it at an optimal speed, and find it interesting.” (Dale & Chall, 1949) “ease of understanding or comprehension due to the style of writing” (Klare, 1963)

How can we measure Readability?  A number of traditional readability formulas use simplistic measures.. Sentence length Word frequency lists Number of syllables Common Formulas: Flesch-Kincaid, Dale-Chall, Gunning-Fog  In order to… Categorise educational texts for grade levels Help authors write for a target audience  These formulae have been criticised because they only measure surface level statistics  E.g the word “quark” has only one syllable but is a difficult concept to comprehend

How can we measure Readability?  Readability encompasses a number of document characteristics… Legibility of the text:  how physically readable the text is, i.e. font size, paper color, bullet points, graphical representations (treatment of legibility out of scope of thesis – for now anyway!) Syntactic complexity of the text:  grammatical arrangement of words within a sentence, (e.g. active / passive sentences have been shown to affect readability)  Simple/compound sentence/complex sentences Organization of text  rhetorical structure Function of statements in text; evidence, antithesis For example, the word “but” or phrase “on the other hand” can signal an antithesis to a previous statement.  textual cohesion logical linkage between textual units, as indicated by overt formal markers of the relations between texts “Trees are green and have leaves. When many grow in the same place they make up a forest.” The words “many” and “they” refer to “trees” in the first sentence, thus making the two sentences cohesive. Semantic complexity of the text  the difficulty of the concepts/ideas represented in the text  abstractness / concreteness of concepts represented in the text

Does readability exist in a vacuum? Document Characteristics Legibility Syntactic Complexity Semantic Complexity Organisation Rhetorical Structure Coherence User Characteristics Domain Knowledge Reading Ability Learning Style Motivation Task INTERACTION!

Using Readability to improve relevance QUERYIR SYSTEM TOPICALLY RELEVANT SET FEATURE EXTRACTION READABILITY CLASSIFIER RERANK INFERENCES ABOUT USER’S READABILITY PREFERENCE CONTEXTUALLY & TOPICALLY RELEVANT SET Syntactic Complexity Semantic Complexity Organisation Rhetorical Structure Coherence

Feature Extraction: Syntactic Complexity  Syntactic Complexity operationalised by measuring POS statistics and natural language parse tree  This tells us the function of words in a sentence and the complexity of the sentence SENTENCE NOUN PHRASE VERB PHRASE PROPER NOUNVERBNOUN PHRASE PROPER NOUN SUSANHITMICHAEL

Feature Extraction: Semantic Complexity  Operationalised using various external information sources, e.g. Roget’s Thesaurus  The higher up in the thesaurus structure the term appears the more abstract the word is  The lower the term appears in the thesaurus structure the term appears the more specific the concept is  WordNet lexical resource gives a “familiarity” score to nouns and verbs smililar to word frequency lists

Feature Extraction: Rhetorical Structure  This is where NLP gets very difficult…  Rhetorical Structure is still mostly done manually  Presence of cue words in text signal a relation  Only at most 50% relations are signalled  Shallow rhetorical structure analysis will be performed – deep analysis not easily automated (yet)  Examples of relations: Evidence Background Antithesis

Feature Extraction: Rhetorical Structure

Feature Extraction: Lexical Cohesion  How well a text fits together, a measure of the coherence of the text  Operationalised by computing the number and density of lexical chains – repititions, synonyms, anaphora etc.  A lexical chain is a sequence of related words in the text, spanning short (adjacent words or sentences) or long distances (entire text).  For example: Plane Fly Its Wing Pilot The plane could not continue To fly. There was a problem With its wing. The pilot made An emergency landing.

Readability Classifier  Novel machine learning approach  Classify the topically relevant set of documents returned using traditional IR model  Re-rank the topically relevant set boosting documents with the appropriate level of readability

C5.0: Decision Tree Learner  A set of documents, pre-classified by readability are given to C5.0  C5.0 is given the feature set for these documents E.g. Doc001 contains 5% prepositions, 20% adjectives…, contains 3 lexical chains, contains 15 terms that represent complex ideas and 14 statements of evidence.

C5.0: Decision Tree Learner  The classifier examines the values given for each feature and returns a set of rules that tell us how to classify the document  E.G:  IF proportion of adjectives > 15% and Number lexical chains >= 4 THEN: Document is easily readable

Work in Progress  No significant existing corpus annotated with readability data  Large scale modern study to find best feature set to classify for readability that is not domain specific  How best to infer a user’s level of domain knowledge: implicit / explicit?  How best to incorporate readability into an IR environment without compromising topically relevant set

Initial Experiments  Machine learning on 2394 “easy” and “difficult” documents using POS statistics (to obtain a measure of syntactic complexity) and traditional Flesch Forumula. FoldPOSFleschCombined 012.09.86.9 112.49.76.4 214.09.66.5 312.59.56.8 412.59.66.9 510.89.57.0 611.49.86.2 711.89.86.2 811.810.06.9 912.59.77.2 Mean12.2% 6.7% SE0.3% 0.1%

Thanks  Questions?

Readability: An application in Information Retrieval Lorna Kane Intelligent Information Retrieval Group, School of Computer Science and Informatics, University.

Similar presentations

Presentation on theme: "Readability: An application in Information Retrieval Lorna Kane Intelligent Information Retrieval Group, School of Computer Science and Informatics, University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Readability: An application in Information Retrieval Lorna Kane Intelligent Information Retrieval Group, School of Computer Science and Informatics, University.

Similar presentations

Presentation on theme: "Readability: An application in Information Retrieval Lorna Kane Intelligent Information Retrieval Group, School of Computer Science and Informatics, University."— Presentation transcript:

Similar presentations

About project

Feedback