Presentation is loading. Please wait.

Presentation is loading. Please wait.

What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect.

Similar presentations


Presentation on theme: "What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect."— Presentation transcript:

1

2 What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect the success of a group of readers have with it. The success is the extent to which they understand it, read it at an optimal speed, and find it interesting.” (Dale & Chall, 1949)  “ease of understanding or comprehension due to the style of writing” (Klare, 1963)

3  Readability encompasses a number of areas…  Syntactic complexity of the text ▪ grammatical arrangement of words within a sentence, (e.g. active / passive sentences have been shown to affect readability) ▪ Simple/compound sentence/complex sentences  Organization of text ▪ discourse structure ▪ textual cohesion  Semantic complexity of the text

4  Improve literacy rate  Improving instruction delivery  Judging technical manuals  Matching text to appropriate grade level  And many more…

5  Assign score to text based on some textual cues (e.g., average sentence length)  Readability formula  Over 200 formulas by 1980s (DuBay 2004)  Textual cues ▪ sentence length, percentage of familiar words, and word length, syllables per word etc.  Testing validity: correlating predicted score to reading comprehension score

6

7  Dale-Chall Formula  Maintains a list of “easy words”.  Score =.1579PDW +.0496ASL + 3.6365 ▪ PDW= Percentage of Difficult Words  FOG index  Lexile scale  Commonalities among formulae  Linear regression over some predictor variables

8  Traditional readability measures are robust for large sample size (textbook and essays) as compared to short and consize web documents.  Web documents are generally noisy Resource: Predicting Reading Difficulty With Statistical Language Models, Kevyn Collins-Thompson and Jamie Callan

9  LM can encode more complex relationships as compared to simple linear regression model in traditional readability measures  A probabilistic distribution in all grade levels  Relative difficulty of words can be obtained statistically as compared to hardcoded approach in traditional measures

10  Earlier grade readers tend to use more concrete words (e.g. red); later grade readers use more abstract words (e.g., determine)  Same observations in web documents

11

12

13

14 Text words

15 Token

16

17

18

19

20

21

22

23

24

25

26

27

28  Smooth individual grade-based language model using Good-Turing smoothing  We have estimate of total probability mass of all unseen words  We need to find each unseen word’s share of this total probability mass  Uniform probability distribution?

29  Usage of discriminative words are clustered towards grade levels.  Borrow probability mass from neighboring grade classes

30

31

32 Readability Score assigned documents Training New doc Readability Score Resource: Revisiting Readability: A Unified Framework for Predicting Text Quality, Emily Pitler and Ani Nenkova

33  There are different predictor variables indicating readability score  What is a the contribution of individual predictor variable in readability score?  Testing methodology Collect Readability Corpus Extract Predictor Variable Measure Correlation

34

35

36 +Ve -Ve

37

38

39

40

41  Log likelihood, WSJ  article likelihood estimated from a language model from WSJ  Log likelihood, NEWS  article likelihood according to a unigram language model from NEWS  LL with length, WSJ  Linear regression of WSJ unigram and article length  LL with length, NEWS  Linear regression of NEWS unigram and article length

42  Average parse tree height  Average number of noun phrases per sentence  Average number of verb phrases per sentence  Average number of subordinate clauses per sentence  Counting SBAR nodes in parse tree

43  Curious case of average verb phrases  No of verb phrases per sentence may increase the text complexity ▪ average verb phrases should have a negative correlation  Let’s look at the following examples  It was late at night, but it was clear. The stars were out and the moon was bright. (1)  It was late at night. It was clear. The stars were out. The moon was bright. (2)

44  Aspects of well written discourse  Cohesive devices like pronouns, definite descriptions, topic continuity  Number of pronouns per sentence  Number of definite articles per sentence  Average cosine similarity  Word overlap  Word overlap over nouns and pronouns

45  Entity based approach towards local coherence  discourse coherence is achieved in view of the way discourse entities are introduced and discussed  Some entities are more salient than others ▪ Salient entities are more likely to appear in prominent syntactic positions (such as subject or object), and to be introduced in a main clause. ▪ Centering theory models the continuity of discourse

46  Entity-Grid discourse representation  Each text is represented by an entity grid ▪ A two-dimensional array that captures the distribution of entities across text sentences. Optional Resource: Modeling Local Coherence: An Entity-Based Approach, Regina Barzilay and Mirella Lapata

47

48 If a noun phrase appears more than once in a sentence, we resort to grammatical role based ranking [S>O>X] -- Sentence 1: ‘Microsoft’ appears as subject (S) and rest (X) category -- Mark entry for Microsoft as S

49

50

51

52

53

54

55  Increase in number of discourse relations in a document will lower the log-likelihood  Number of relations in a document as feature

56

57  200+ readability measures and still counting  Are they really looking at deeper aspects of language comprehension?  Are they tuned towards individual reading abilities?  Is reader in the loop?

58  How do we comprehend sentences?  How do we store and access words?  How do we resolve ambiguities?


Download ppt "What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect."

Similar presentations


Ads by Google