© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide 14-September-2014 Prof. Dr.-Ing. Ralf Steinmetz KOM - Multimedia Communications Lab iKNOW_SentenceClassification__SebS___2014.09.18.pptx Authors: Sebastian Schmidt (presenting) Steffen Schnitzer Christoph Rensing Generic Sentence Classification: Examining the Scenario of Scientific Abstracts and Scrum Protocols Image source: www.moebellisten.de

KOM – Multimedia Communications Lab2 Introduction  Motivation  Challenge and concept Scenarios  Overview  Corpora Approach used for classification Evaluation  Setup  Results for the scenarios Conclusion and Future Work Outline

KOM – Multimedia Communications Lab3 Information overload through flood of textual documents  Professional settings  Research settings  Educational settings Hard for individuals to find relevant textual documents according to their information need String-based filtering can help to reduce the amount of documents to be read  “Find online tutorials that deal with Java”  “I am searching for a job in the pharmaceutical sector” Motivation

KOM – Multimedia Communications Lab4 Contextual ambiguity Pre-filtering of text sections can help!  Based on the type of information contained Goal: A generic concept for sentence-type classification Challenge & Concept  “Cleaning staff wanted! We are a company in the pharmaceutic sector.” vs.  “We are acquiring people having pharmaceutic training”  “For taking this course you should know about Java programming.” vs.  “After this course you will be an expert in Java programming.”

KOM – Multimedia Communications Lab6 Abstract consists of the content in a condensed form Typical queries from researchers Types can be assigned to the sentences, e.g.  Motivation  Goals  Related Work → Knowing this type simplifies the execution of the queries Scenarios Abstracts of Scientific Articles Which other articles face a particular problem? Which other articles use a particular approach? Which approach performs best for a specific problem?

KOM – Multimedia Communications Lab7 Common questions (with variations)  What went well?  What went wrong?  What could be improved? Often informal content  “Testing took too long”  “Teamwork was excellent”  ….. Management might be interested in particular ones only Automated assignment to questions could simplify the creation of the protocols Scenarios Protocols of Scrum Retrospective Meetings Image source: commons.wikimedia.org

KOM – Multimedia Communications Lab8 Corpora Abstracts of Scientific Articles (Multimedia) Image source: http://digitalsherpa.com/how-to-use-social-media-to-conduct-market-research/

KOM – Multimedia Communications Lab9 1000 abstracts  8,633 sentences Biomedical domain 7 classes  Background  Objective  Result …… Sentences annotated with one label by three annotators  High inter-annotator agreement (κ= 0.85) → Annotations of only one annotator were used →Corpus BioM Corpora Abstracts of Scientific Articles ([1]) Image source: http://www.dmu.ac.uk/research/research-faculties-and-institutes/health-and-life- sciences/biomedical-and-environmental-health/biomedical-and-environmental-health.aspx

KOM – Multimedia Communications Lab10 139 Scrum retrospective protocols from major software company  653 sentences Sentences were clustered into  “What went well?”  “What went wrong?”  “What could be improved?” → Corpus Scrum All sentences that could not be assigned to a cluster by humans were removed, e.g.  “Timing”  “Collaboration with Peter Smith” → Corpus Scrum_Subset Corpora Protocols of Scrum Retrospective Meetings

KOM – Multimedia Communications Lab12 Supervised classification with domain-independent features 10 feature groups Approach  Content  All words as features  Sentiment  Positive/negative based on word-to- sentiment mapping  Negation  Count of negation words  Tense  Based on Stanford Lexicalized Parser  Tense indicator  Based on word endings and modal verbs  Adjectives  Based on Stanford Lexicalized Parser  Indicative indicator  Count of “need”, “should”, “must”  Personal pronouns  Based on Stanford Lexicalized Parser  Position of the sentence  Normalized position of the sentence within its context  Number of words  Total number of words

KOM – Multimedia Communications Lab14 Different Classifiers used  Support Vector Machines  Naïve Bayes  J48 Weka 10-fold cross validation Evaluation Setup Image source: http://www.cs.waikato.ac.nz/ml/weka/, http://scriptslines.com/blog/k-fold-cross-validation/

KOM – Multimedia Communications Lab15 Evaluation Abstracts of Scientific Articles (F1-Measure) MMBioM SVMNBJ48SVMNBJ48 All features0.6920.6900.6400.7980.7310.739 Single feature Words0.6340.6680.5750.7480.6830.668 Position0.4890.4870.4920.5570.5400.554 Tense Indicator0.2780.2790.2650.2540.319 All except single feature Words0.5550.4920.5100.6660.6050.648 Position0.6340.6560.5760.7500.6700.675 Adjectives0.6990.6920.6410.7990.7350.738 Best results for SVM Words alone gives results that are OK Results can be better when not using all features

KOM – Multimedia Communications Lab16 Evaluation Abstracts of Scientific Articles Different tag sets for the same kind of corpus do only seem to have a minor influence on the results → Size of evaluation data is more relevant

KOM – Multimedia Communications Lab17 Evaluation Protocols of Scrum Retrospective Meetings (F1-Measure) ScrumScrum_Subset SVMNBJ48SVMNBJ48 All features0.5720.5620.5130.6610.6690.592 Single feature Words0.5520.5330.4850.6470.6440.546 Sentiment0.3230.3790.4250.4150.4640.458 Tense Indicator0.3570.3390.4100.366 0.315 All except single feature Words0.4670.4840.4660.5500.5700.548 Sentiment0.5580.5500.4950.6560.6500.565 Adjectives0.5720.5600.5200.6640.6850.606 Best results for SVM/NB In the subset Sentiment is meaningful Results can be better when not using all features

KOM – Multimedia Communications Lab19 Results generally good  Also the training corpora are not too large  No domain-specific features required Worse results for Scrum scenarios  Incorrect grammar  Many typos  Shorter sentences Adding contextual information might be helpful Implementation in application needed for evaluation of usefulness of filtering concept Conclusion & Future Work

KOM – Multimedia Communications Lab20 Questions & Contact Image Source: http://www.dreifragezeichen.de/

KOM – Multimedia Communications Lab21 [1] Y. Guo, A. Korhonen, M. Liakata, I. S. Karolinska, L. Sun, and U. Stenius. Identifying the information structure of scientific abstracts: An investigation of three different schemes. In Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, BioNLP ’10, page 99–107, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. References

KOM – Multimedia Communications Lab22 Backup Slides Results Scientific Abstracts

KOM – Multimedia Communications Lab23 Backup Slides Results Scrum

© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

Similar presentations

Presentation on theme: "© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

Similar presentations

Presentation on theme: "© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide."— Presentation transcript:

Similar presentations

About project

Feedback