Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010.

Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Domain-Specific Resources 2WING, NUS Wikipedia page on modular arithmetic Interactivate page on clocks and modular arithmetic Domain-specific resources cater for a wide range of audience.

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Challenge for a Domain-Specific Search Engine 3WING, NUS How to measure readability for domain- specific resources?

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Literature Review Heuristic Readability Measures –Weighted sum of textual feature values –Examples:  Flesch Kincaid Reading Ease:  Dale-Chall: –Quick and indicative but oversimplifying 4WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Literature Review Natural Language Processing and Machine Learning Approaches –Extract deep text features and construct sophisticated models for prediction –Text Features  N-gram, height of parse tree, Discourse relations –Models  Language Model, Naïve Bayes, Support Vector Machine –More accurate but annotated corpus required and ignorant of the domain-specific concepts 5WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Literature Review Domain-Specific Readability Measures –Derive information of domain-specific concepts from expert knowledge sources –Examples:  Wordlist  Ontology –Also improves performance but knowledge sources still expensive and not always available 6WING, NUS Is it possible to measure readability for domain-specific resources without expensive corpus/knowledge source?

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Intuitions A domain-specific resource is less readable than another if the former contains more difficult concepts A domain-specific concept is more difficult than another if the former appears in less readable resources Use an iterative computation algorithm to estimate these two scores from each other Example: –Pythagorean theorem vs. ring theory 7WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Algorithm Required Input –A collection of domain-specific resources (w/o annotation) –A list of domain-specific concepts Graph Construction –Construct a graph representing resources, concepts and occurrence information Score Computation –Initialize and iteratively compute the readability score of domain- specific resources and the difficulty score of domain-specific concepts 8WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Graph Construction Preprocessing –Extraction of occurrence information Construction steps –Resource node creation –Concept node creation –Edge creation based on occurrence information 9WING, NUS Pythagorean Theorem… …triangle… …sine… …tangent… trigonometry...sine… …tangent… …triangle… Resource 1 Resource 2 Concept List Pythagorean Theorem, tangent, triangle trigonometry, sine, Pythagorean Theorem triangle sine tangent trigonometry Resource 1 Resource 2

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Score Computation Initialization –Resource Node (FKRE) –Concept Node (Average score of neighboring nodes) Iterative Computation – All nodes (Current score + average score of neighboring nodes) Termination Condition –The ranking of the resources stabilizes 10WING, NUS wxyz abc Resource Nodes Concept Nodes wxyzabc Initialization1335234 Iteration 135.56.59468 Iteration 2710.513.5178.251215.75

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Evaluation Goals –Effectiveness  Iterative computation vs. other readability measures in math domain –Efficiency  Iterative computation with domain-specific resources and concepts selection in math domain –Portability  Iterative computation vs. other readability measures in medical domain 11WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Effectiveness Experiment Corpus –Collection  27 math concepts  1 st 100 search results from Google –Annotation  120 randomly chosen webpages  Annotated by first author and 30 undergraduate students using a 7- point readability scale  Kappa: 0.71, Spearman’s rho: 0.93 12WING, NUS ValueEducation Background 1Primary 2Lower Secondary 3Higher Secondary 4Junior College (Basic) 5Junior College (Advanced) 6University (Basic) 7University (Advanced)

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Effectiveness Experiment Baseline: –Heuristic  FKRE –Supervised learning  Naïve Bayes, Support Vector Machine, Maximum Entropy  Binary word features only Metrics: –Pairwise accuracy –Spearman’s rho 13WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Effectiveness Experiment Results –FKRE and NB show modest correlation –SVM and Maxent perform significantly better –Best performance is achieved by iterative computation 14WING, NUS PairwiseSpearman FKRE.72.48 NB.72.52 SVM.80.70 Maxent.82.67 IC.85.72

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Efficiency Experiment Corpus/Metrics same as before Different selection strategies –Resource selection by random –Resource selection by quality –Concept selection by random –Concept selection by TF.IDF 15WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Efficiency Experiment Results –If chosen at random, the more resources/concepts the better –When chosen by quality, a small set of resources is also sufficient –Selection by TF.IDF helps to filter out useless concepts 16WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Portability Experiment Corpus –Collection  27 medical concepts  1 st 100 search results from Google –Annotation  Readability of 946 randomly chosen webpages annotated by first author on the same readability scale Metric/Baseline same as before 17WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Portability Experiment Results –Heuristic is still the weakest –Supervised approaches benefit greatly from the larger amount of annotation –Iterative computation remains competitive –Limited readability spectrum in medical domain 18WING, NUS PairwiseSpearman FKRE.63.28 NB.73.53 SVM.82.70 Maxent.76.60 IC.72.49 ICS.75.54

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Future Work Processing –Noise reduction Probabilistic formulation –Distribution of values  e.g. 70% of webpages highly readable and 30% much less readable –Correlations between multiple pairs of attributes  e.g. Genericity and page type 19WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Conclusion Iterative Computation –Readability of domain-specific resources and difficulty of domain-specific concepts can be estimated from each other –Simple yet effective, efficient and portable Part of the exploration in Domain-specific Information Retrieval –Categorization –Readability –Text to domain-specific construct linking 20WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Any questions? 21WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Related Graph-based Algorithms PageRank –Directed links –Backlinks indicate popularity/recommendation HITS –Hub and authority score for each node SALSA 22WING, NUS

Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010.

Similar presentations

Presentation on theme: "Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010.

Similar presentations

Presentation on theme: "Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010."— Presentation transcript:

Similar presentations

About project

Feedback