Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010.

Slides:



Advertisements
Similar presentations
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
Sentiment Analysis An Overview of Concepts and Selected Techniques.
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Computing Trust in Social Networks
Presented by Zeehasham Rasheed
+ Doing More with Less : Student Modeling and Performance Prediction with Reduced Content Models Yun Huang, University of Pittsburgh Yanbo Xu, Carnegie.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Scalable Text Mining with Sparse Generative Models
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
Advanced Multimedia Text Classification Tamara Berg.
Keyphrase Extraction in Scientific Documents Thuy Dung Nguyen and Min-Yen Kan School of Computing National University of Singapore Slides available at.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.
by B. Zadrozny and C. Elkan
PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.
Special Topics in Text Mining Manuel Montes y Gómez University of Alabama at Birmingham, Spring 2011.
Which of the two appears simple to you? 1 2.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Research Ranked Recall: Efficient Classification by Learning Indices That Rank Omid Madani with Michael Connor (UIUC)
Shoaib Jameel, Wai Lam and Xiaojun Qian The Chinese University of Hong Kong Ranking Text Documents Based on Conceptual Difficulty Using Term Embedding.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Domain-Specific Iterative Readability Computation Jin Zhao 13/05/2011.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
LOGO Finding High-Quality Content in Social Media Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis and Gilad Mishne (WSDM 2008) Advisor.
An Ensemble of Three Classifiers for KDD Cup 2009: Expanded Linear Model, Heterogeneous Boosting, and Selective Naive Bayes Members: Hung-Yi Lo, Kai-Wei.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Algorithmic Detection of Semantic Similarity WWW 2005.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Ranking Link-based Ranking (2° generation) Reading 21.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Medical Information Retrieval: eEvidence System By Zhao Jin Mar
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Post-Ranking query suggestion by diversifying search Chao Wang.
Supervised Random Walks: Predicting and Recommending Links in Social Networks Lars Backstrom (Facebook) & Jure Leskovec (Stanford) Proc. of WSDM 2011 Present.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition.
Ph.D Milestones Jin Zhao 23 Nov Jin Zhao 23 Nov 2012 / 10 Timeline 2 WING, NUS Initial Exploration Topic Formulation.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Data Mining and Text Mining. The Standard Data Mining process.
Semi-Supervised Clustering
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Source: Procedia Computer Science(2015)70:
Context-based Data Compression
Machine Learning Feature Creation and Selection
A Comparative Study of Link Analysis Algorithms
Presented by: Prof. Ali Jaoua
Measuring Complexity of Web Pages Using Gate
Hubs and Authorities & Learning: Perceptrons
Presentation transcript:

Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Domain-Specific Resources 2WING, NUS Wikipedia page on modular arithmetic Interactivate page on clocks and modular arithmetic Domain-specific resources cater for a wide range of audience.

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Challenge for a Domain-Specific Search Engine 3WING, NUS How to measure readability for domain- specific resources?

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Literature Review Heuristic Readability Measures –Weighted sum of textual feature values –Examples:  Flesch Kincaid Reading Ease:  Dale-Chall: –Quick and indicative but oversimplifying 4WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Literature Review Natural Language Processing and Machine Learning Approaches –Extract deep text features and construct sophisticated models for prediction –Text Features  N-gram, height of parse tree, Discourse relations –Models  Language Model, Naïve Bayes, Support Vector Machine –More accurate but annotated corpus required and ignorant of the domain-specific concepts 5WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Literature Review Domain-Specific Readability Measures –Derive information of domain-specific concepts from expert knowledge sources –Examples:  Wordlist  Ontology –Also improves performance but knowledge sources still expensive and not always available 6WING, NUS Is it possible to measure readability for domain-specific resources without expensive corpus/knowledge source?

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Intuitions A domain-specific resource is less readable than another if the former contains more difficult concepts A domain-specific concept is more difficult than another if the former appears in less readable resources Use an iterative computation algorithm to estimate these two scores from each other Example: –Pythagorean theorem vs. ring theory 7WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Algorithm Required Input –A collection of domain-specific resources (w/o annotation) –A list of domain-specific concepts Graph Construction –Construct a graph representing resources, concepts and occurrence information Score Computation –Initialize and iteratively compute the readability score of domain- specific resources and the difficulty score of domain-specific concepts 8WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Graph Construction Preprocessing –Extraction of occurrence information Construction steps –Resource node creation –Concept node creation –Edge creation based on occurrence information 9WING, NUS Pythagorean Theorem… …triangle… …sine… …tangent… trigonometry...sine… …tangent… …triangle… Resource 1 Resource 2 Concept List Pythagorean Theorem, tangent, triangle trigonometry, sine, Pythagorean Theorem triangle sine tangent trigonometry Resource 1 Resource 2

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Score Computation Initialization –Resource Node (FKRE) –Concept Node (Average score of neighboring nodes) Iterative Computation – All nodes (Current score + average score of neighboring nodes) Termination Condition –The ranking of the resources stabilizes 10WING, NUS wxyz abc Resource Nodes Concept Nodes wxyzabc Initialization Iteration Iteration

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Evaluation Goals –Effectiveness  Iterative computation vs. other readability measures in math domain –Efficiency  Iterative computation with domain-specific resources and concepts selection in math domain –Portability  Iterative computation vs. other readability measures in medical domain 11WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Effectiveness Experiment Corpus –Collection  27 math concepts  1 st 100 search results from Google –Annotation  120 randomly chosen webpages  Annotated by first author and 30 undergraduate students using a 7- point readability scale  Kappa: 0.71, Spearman’s rho: WING, NUS ValueEducation Background 1Primary 2Lower Secondary 3Higher Secondary 4Junior College (Basic) 5Junior College (Advanced) 6University (Basic) 7University (Advanced)

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Effectiveness Experiment Baseline: –Heuristic  FKRE –Supervised learning  Naïve Bayes, Support Vector Machine, Maximum Entropy  Binary word features only Metrics: –Pairwise accuracy –Spearman’s rho 13WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Effectiveness Experiment Results –FKRE and NB show modest correlation –SVM and Maxent perform significantly better –Best performance is achieved by iterative computation 14WING, NUS PairwiseSpearman FKRE NB SVM Maxent IC.85.72

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Efficiency Experiment Corpus/Metrics same as before Different selection strategies –Resource selection by random –Resource selection by quality –Concept selection by random –Concept selection by TF.IDF 15WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Efficiency Experiment Results –If chosen at random, the more resources/concepts the better –When chosen by quality, a small set of resources is also sufficient –Selection by TF.IDF helps to filter out useless concepts 16WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Portability Experiment Corpus –Collection  27 medical concepts  1 st 100 search results from Google –Annotation  Readability of 946 randomly chosen webpages annotated by first author on the same readability scale Metric/Baseline same as before 17WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Portability Experiment Results –Heuristic is still the weakest –Supervised approaches benefit greatly from the larger amount of annotation –Iterative computation remains competitive –Limited readability spectrum in medical domain 18WING, NUS PairwiseSpearman FKRE NB SVM Maxent IC ICS.75.54

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Future Work Processing –Noise reduction Probabilistic formulation –Distribution of values  e.g. 70% of webpages highly readable and 30% much less readable –Correlations between multiple pairs of attributes  e.g. Genericity and page type 19WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Conclusion Iterative Computation –Readability of domain-specific resources and difficulty of domain-specific concepts can be estimated from each other –Simple yet effective, efficient and portable Part of the exploration in Domain-specific Information Retrieval –Categorization –Readability –Text to domain-specific construct linking 20WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Any questions? 21WING, NUS

Jin Zhao and Min-Yen Kan 11/06/2010 / 20 Related Graph-based Algorithms PageRank –Directed links –Backlinks indicate popularity/recommendation HITS –Hub and authority score for each node SALSA 22WING, NUS