Presentation on theme: "School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING An open discussion and exchange of ideas Introduced by Eric Atwell, Language."— Presentation transcript:
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING An open discussion and exchange of ideas Introduced by Eric Atwell, Language Research Group Natural Language Processing (NLP) + Visualization and Virtual Reality (VVR)
… Eric will present aspects of NLP research projects which involve "visualisation" of text, to seek advice on further visualisation techniques NLP researchers should consider; and other NLPers can ask about visualisation techniques they could use. The VVR "angle" may be that current visualisation methods work mainly for numerical datasets, so the VVR people might benefit from ideas on text analytics techniques which might "turn text into numbers: what sorts of number- vectors can represent meanings of texts, and how to extract them. Saman Hina (NLP seminar coordinator):
Typical NLP research NLP research often involves developing an algorithm to automatically process some text and output analysis, eg -For each word, its Part of Speech (or semantic class, or…) -For each sentence, its grammatical structure (parse-tree) -For each text, its classification: Genre, sentiment, CoD, interesting wrt specific task/users Often this is done by Machine Learning: given a training dataset of example words/sentences/texts, each marked (beforehand) with its Class … learn a Classifier which can predict the Class of any new, unseen word/sentence/text. The algorithm is automatic, so where does Visualisation fit?
Visualisation of feature space? Machine Learning is automatic (eg using WEKA toolkit), the classification is not done by humans … BUT ML relies on mapping each word/sentence/text into a set of FEATURES which characterise the data Visualisation may guide the researcher in exploring the dataset, to choose useful features? OR: ML with different parameter-settings can produce different classification models; Visualisation may help the researcher to compare the models?
Typical NLP dataset: a CORPUS (plural: Corpora or Corpuses) Quran – English translation; interesting subset of versesEnglish translationsubset of verses Leeds Arabic NLP http://www.comp.leeds.ac.uk/arabic/http://www.comp.leeds.ac.uk/arabic/ Arabic morphological analysis toolsmorphological analysis tools Quranic Arabic Corpus http://corpus.quran.com/http://corpus.quran.com/ Verbal AutopsyVerbal Autopsy interviews: narrative text + yes/no, numbers SNOMED-CTSNOMED-CT Systematized Nomenclature of Medicine Clinical Terms adopted by UK NHS and US health authoritiesNHS
Verbal Autopsy Dataset Verbal Autopsy: interview of mother after death of her baby. Data collected as part of a main trial over 7 year period 10,000 interview reports; Data collected includes: Signs and symptoms that led to the death History of any ailments Socio economic characteristic Care seeking and treatment Fertility and obstetric history Classification of Cause of Death by doctors at LSHTM - London School of Hygene and Tropical Medicine, Uni London Based on signs, symptoms and expert knowledge
Problems with VA data Both quantitative and qualitative Missing values (-) 215 variables (plus narrative text) Entries can have opaque codes sex = 1, 2, 8 or 9 Weight= 1.45, 9.99 or 8.88 Continuous revision of questionnaire created blank values for some variables Visualization of decision tree is problematic (size =1043, leaves=601); also other classifier outputs, eg Naïve Bayesdecision tree Naïve Bayes
Visualising Corpus Linguistics Paul Rayson presented overview of techniques at CL2009 International Conference on Corpus Linguistics: Paul Rayson and John Mariani, 2009. Visualising Corpus Linguistics I like the Key Word Clouds from CL2001 … CL2009 !!! … Wordle etc make pretty pictures, for PR etc; BUT do word clouds actually help guide NLP research???
Open to discussion Over to you: NLPers can ask about visualisation techniques they could use VVRers can ask about ideas on text analytics techniques which might turn text into numbers And/or any other ideas? … THANK YOU for your participation