Presentation on theme: "Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester."— Presentation transcript:
Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester
Contents What is Text Mining/What is NaCTeM? Approaches/Methods Text Mining Tasks –IE, Argumentative Zoning, Terminology Discovery End-user services for researchers NaCTeM activities with social scientists
What is Text Mining? Knowledge discovery from textual sources –Primary sources Documents, News, Web –Scientific Literatures Using NLP, Ontologies, IR on a large scale
What is the Text Mining Centre? http://www.nactem.ac.uk Established in 2004 in response to a JISC/EPSRC/BBSRC initiative A Manchester and Liverpool collaboration –Formerly also UMIST, Salford –Accommodated in the Manchester Interdisciplinary Biocentre (MIB) Develop a variety of national services based on the application to biological sciences, with deployment from Autumn 2006 Initially in biological sciences, with a second focus on social science during 2006-7
Text Mining - Approaches Distinguished from IR by semantic analysis leading to extraction of entities, facts, events, not mere documents. Distinguished from the Semantic Web by use of automated analysis based on robust natural language processing. A wide variety of methods and analyses ranging from domain-independent to domain- specific.
Methods of Text Mining Pipelined processes performing increasing levels of analysis common to all approaches –Document structure analysis, tokenization, tagging, phrasal chunking, named entity recognition/classification, fact and event extraction. –Indexed to provide conceptual IR services
Sample text mining sub-tasks Named entity recognition and classification. Terminology discovery and ontology maintenance Information extraction (IE) in limited domains - for intelligence analysts and scientists Summarization - informative, tailored, multilingual, multi-document Open-domain IE and QA Association mining over databases of extracted facts.
Illustrations of IE on successive full-page screenshots Named entity phrase bracketing Named entity extraction Fact extraction and slot filling An application to a research literature
Terminology Discovery - Ananiadou, NaCTeM A form of unsupervised learning, whose only required resource is a general purpose PoS tagger. Can be applied to text in any language, domain or genre to reveal terminology on the basis of phrasehood and distribution. TerMine will be among the first deployed NaCTeM tools.
Argumentative Zoning Simone Teufel, Cambridge Computing Lab BKG: General scientific background (yellow) OTH: Neutral descrs of others work (orange) OWN: Neutral descrs of own, new work (blue) AIM: Stmts of particular aim of current paper (pink) TXT: Stmts of textual org. of current paper (red) CTR: Contrastive or comparative stmts incl. explicit mention of weaknesses of other work (green) BAS: Stmts that own work is based on other work (purple)
End-user services based on full NLP and conceptual indexing Two conceptual IR services based on prior full-scale NLP analysis of Medline at Tsujii Lab, University of Tokyo –InfoPubMed: A complex tool supporting a research workflow for literature review and knowledge discovery/hypothesis generation –Medie: A simple IR interface as intuitive as Google, but returning fact-bearing sentences, which are more than document surrogates.
Possible end-user service based on AZ More than Googles PageRank, because the links are typed.
NaCTeM and Social Science/Humanities In Year 3 (from Oct 2006), develop pilot service aimed at social science. Local links with NCESS Preparatory invited workshop held in May, 2006. Text-mining and Digitised C19th Research Resources Workshop with British Library
Workshop on Text Mining in Social Sciences Presentations available at NaCTeM Web page –Bridging qualitative and quantitative methods for social sciences using text mining techniques (Sophia Ananiadou) –Text Mining Activities at the National Centre (Sophia Ananiadou, Jun-ich Tsujii, Paul Watry) –Smart Qualitative Data: Methods and Community Tools for Data Mark- Up SQUAD (Louise Corti) –Author Identification (Katerina T. Frantzi) –Sentiment Analysis and Financial Grids (Lee Gillam) –Concordances and semi-automatic coding in qualitative analysis: possibilities and barriers (Graham R. Gibbs) –Bridging quantitative and qualitative methods for social sciences using text mining techniques (Tetsuya Nasukawa) –Computer-Assisted Content Analysis (Andrew Wilson)
NaCTeM status NaCTeM is almost at the end of its tool development phase Moving to deployment of services this Autumn Will include domain-independent terminology management from the outset Other applications of interest to social science researchers will be appearing approx. 1 year from now.