Extracting Why Text Segment from Web Based on Grammar-gram Iulia Nagy, Master student, 2010-02-27
Summary Introduction Related work “Bag of Function Words” method Rule Based Methods Machine Learning Approach “Bag of Function Words” method Method outline Adaptation of “Bag of Function Words” to English Experiments and Evaluation Conclusion and Remarks
Problem tremendous growth of the Internet information hard to find
Solution Create QA system an exact question system capable to give an exact answer to an exact question detect answer from arbitrary corpora Purpose obtain viable information rapidly
Purpose of our research Create a why-QA system with automatically-built classifier Classifier Use a model presented in Japanese Literature created using Machine learning based on Bag of Grammar approach Purpose of this paper adapt Japanese method to English test effectiveness of the method on English
Related word Two main trends Rule Based methods Machine Learning methods Preprocess text Detect patterns Create set of rules Apply rules to identify why-answer from text Preprocess text Identify and extract relevant features Create classification scheme Classify
Rule based in why-QA Suzan Vererne’s Approach Method : + - Improve performance by re-ranking Method : weight the score assigned to a QA-pair by QAP with a number of syntactic features. + - Hardly adaptable to various languages Deep grammar knowledge Labour intensive Importance of syntax Effective
Machine Learning method Higashinaka and Isozaki’s Approach Acquire causal expression from Japanese EDR dictionary Method : train a ranker based on clause structures extracted from EDR + - Hardly adaptable to various languages Not fully automated: based on EDR EDR rather high priced Partially automated Effective
Machine Learning method Tanaka’s Approach Build why-classifier with function words as features Method : Bag of function words Adaptable to different languages Domain independent Scalable Effective Fully automated
Bag of function words Function words Machine learning approach to automatically build domain independent why-classifier based of function words Conditions to obtain domain-independence Class fulfilling conditions Convergence and reasonable size of feature space Generality of features in feature space Ability of features to discriminate causality Function words
Bag of function words Method – same baseline for Japanese and English Ts 1 Create feature space Create feature vectors Extract function words Ts 2 … Tag label all words with POS tagger Classify Determine POS for function words Ts n Mapping using “tf-idf” on function words 𝑥 𝑖 , 𝑦 𝑖 𝑦 𝑖 є 𝑡𝑟𝑢𝑒, 𝑓𝑎𝑙𝑠𝑒 Vectors' format: Fv 1 Fv 2 Fv n … for because at after in under which that why to therefore Classification scheme Trainer Loogit Boost weak learners
Adaptation to English Differences Japanese English Adjustments Identify eligible function words in English Japanese Forms phrases by adding new words at the end of the phrase Use of particles to define syntactic roles in a phrase English Forms phrases by adding new words at the beginning of the phrase Words do not belong to an only grammatical category
Dataset : 432 text segments Experiment 216 Why answers 216 definitions Dataset : 432 text segments Data Processing Label all words with POS and extract function words Calculate tf-idf for each function word Map features from feature set into feature vectors
Experiment Classifier Evaluation Used Loogit Boost (Weka) with Decision stump Created 5 classifiers (50, 100, 150, 200, 250 iterations) Evaluation 10-fold cross validation Models trained on 9 folds and tested on 1 Measured precision, recall and F-measure
Results – why text segments No of iterations
Results – non why text segments (NWTS) No of iterations
Method effective on English Conclusion Results 321 instances out of 432 correctly classified 76.1% precision and 70.6% recall on WTS 72.6% precision and 77.9% recall on NWTS Method effective on English Type of TS
Future works Experiment with a increased dataset (> 5000) Use Yahoo!Answers database to extract dataset Interest Include causative construction in the analysis to identify optimal number of iteration to make a better selection of the function words to be used English English often expresses cause by a closed set of verbs or nouns Increase accuracy of the classifier
Thank you for your attention ! Questions and remarks Thank you for your attention !