National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director www.nactem.ac.uk.

Slides:



Advertisements
Similar presentations
GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Advertisements

RSP Summer School14-16 September 2009 UK Institutional Repository Search: a collaborative project to showcase UK research output through advanced discovery.
National Centre for Text Mining John Keane NaCTeM Co-director University of Manchester.
Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.
DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
A centre of expertise in digital information management Tools for the Trade? Supporting Multidisciplinary Research Dr Liz Lyon, Director.
Distributed search for complex heterogeneous media Werner Bailer, José-Manuel López-Cobo, Guillermo Álvaro, Georg Thallinger Search Computing Workshop.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
13 th September 2007 UK e-Science All Hands Meeting Text Mining Services to Support e-Research Brian Rea and Sophia Ananiadou National Centre for Text.
SEVENPRO – STREP KEG seminar, Prague, 8/November/2007 © SEVENPRO Consortium SEVENPRO – Semantic Virtual Engineering Environment for Product.
Automatic indexing and retrieval of crime-scene photographs Katerina Pastra, Horacio Saggion, Yorick Wilks NLP group, University of Sheffield Scene of.
Supporting Further and Higher Education Building the UK National Information Environment - Lessons from the Past and Pointers To the Future Norman Wiseman.
1. UKPMC ‘We exist for everyone who wants to do research – for academic, personal, or commercial purposes.’ - BL Strategy 2005/8.
1 Enriching UK PubMed Central SPIDER launch meeting, Wolfson College, Oxford Paul Davey, UK PubMed Central Engagement Manager.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
BioText Infrastructure Ariel Schwartz Gaurav Bhalotia 10/07/2002.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
Špindlerův Mlýn, Czech Republic, SOFSEM Semantically-aided Data-aware Service Workflow Composition Ondrej Habala, Marek Paralič,
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Key integrating concepts Groups Formal Community Groups Ad-hoc special purpose/ interest groups Fine-grained access control and membership Linked All content.
Enriching the Ontology for Biomedical Investigations (OBI) to Improve Its Suitability for Web Service Annotations Chaitanya Guttula, Alok Dhamanaskar,
Universität Stuttgart Universitätsbibliothek Information Retrieval on the Grid? Results and suggestions from Project GRACE Werner Stephan Stuttgart University.
Distributed Access to Data Resources: Metadata Experiences from the NESSTAR Project Simon Musgrave Data Archive, University of Essex.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
1 Common Challenges Across Scientific Disciplines Laurence Field CERN 18 th November 2013.
National Centre for Text Mining NaCTeM e-science and data mining workshop John Keane Co-Director, NaCTeM School of Informatics,
Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining
University of Dublin Trinity College Localisation and Personalisation: Dynamic Retrieval & Adaptation of Multi-lingual Multimedia Content Prof Vincent.
Using the Open Metadata Registry (openMDR) to create Data Sharing Interfaces October 14 th, 2010 David Ervin & Rakesh Dhaval, Center for IT Innovations.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Flexible Text Mining using Interactive Information Extraction David Milward
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
EU Project proposal. Andrei S. Lopatenko 1 EU Project Proposal CERIF-SW Andrei S. Lopatenko Vienna University of Technology
DataNet – Flexible Metadata Overlay over File Resources Daniel Harężlak 1, Marek Kasztelnik 1, Maciej Pawlik 1, Bartosz Wilk 1, Marian Bubak 1,2 1 ACC.
SEEK Welcome Malcolm Atkinson Director 12 th May 2004.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
ICCS WSES BOF Discussion. Possible Topics Scientific workflows and Grid infrastructure Utilization of computing resources in scientific workflows; Virtual.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
ACGT: Open Grid Services for Improving Medical Knowledge Discovery Stelios G. Sfakianakis, FORTH.
CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool.
ESFRI & e-Infrastructure Collaborations, EGEE’09 Krzysztof Wrona September 21 st, 2009 European XFEL.
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
UKOLN is supported by: Introduction to UKOLN Dr Liz Lyon, Director UKOLN, University of Bath, UK Grand Challenge Meeting, June a centre.
26/05/2005 Research Infrastructures - 'eInfrastructure: Grid initiatives‘ FP INFRASTRUCTURES-71 DIMMI Project a DI gital M ulti M edia I nfrastructure.
EBI is an Outstation of the European Molecular Biology Laboratory. Literature Resources at the EBI Information Workshop on European Bioinformatics Resources.
Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
1 e-Arts and Humanities Scoping an e-Science Agenda Sheila Anderson Arts and Humanities Data Service Arts and Humanities e-Science Support Centre King’s.
Collection-level description: from theory to practice Minerva project meeting Paris, 24 January 2003 Pete Johnston UKOLN, University of Bath Bath, BA2.
Fedora Commons Overview and Background Sandy Payette, Executive Director UK Fedora Training London January 22-23, 2009.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
INTAROS WP5 Data integration and management
CCNT Lab of Zhejiang University
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.
Web archives as a research subject
Presentation transcript:

National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Outline Overview of NaCTeM, role in e- Infrastructure Quick tour of NaCTeM services/tools Some challenges for text mining in biology

UK National Centre for Text Mining (NaCTeM) ‏ 1 st national text mining centre in the world Location: Manchester Interdisciplinary Biocentre (MIB) Remit: Provision of text mining services to support UK research Funded by: the JISC, BBSRC, EPSRC Initial focus: Biology

Why is there a need in the UK for a national centre for text mining? Some researchers knew they wanted TM TM key component of e-Science UK policy to involve more researchers (from all domains) in doing e-science and e-research TM seen as key technology for researchers And one applicable in every domain (broad interest/support from major funding bodies)‏

Embedding Text Mining within e-Science in the UK e-Science […] enables new research and increases productivity through shared e ‑ Infrastructure, the development of computational and logical models and new ways to discover and use the growing range of distributed and interoperable resources. It supports multidisciplinary and collaborative working and a culture that adopts the emerging methods. M. Atkinson (2007) Beyond e-Science

e-Research Infrastructure Access to information, data resources, distributed computing essential for bio- scientists e-Infrastructure provides services and facilities enabling advanced research Deploying e-Infrastructure increases the pace and efficiency of new research methods and techniques

E-Infrastructure and Text Mining: for whom? End-users Software engineers Generic tool developers Service and resource providers Workflow developers Text miners BLAST, Swiss-Prot,

What users want to do with their data (minimally) Easier access to data Share data with their peers Annotate data with metadata Manage data across locations Integrate data within workflows, Web Services Aids for semantic metadata creation; enriching data with related metadata e.g. experimental results TEXT MINING CAN SUPPORT USERS

METABOLIC MODEL IN SBML CREATE MODEL VISUALISE STORE MODEL IN DB RUN BASE MODEL SENSITIVITY ANALYSES SCAN PARAMETER SPACE COMPARE WITH ‘REAL’ DATA DIFFERENT METABOLIC MODEL IN SBML TEXT MINING ANNOTATE STORE NEW MODEL IN DB COMPARE MODELS STORE DIFFERENCES AS NEW MODEL IN DB SYSTEMS BIOLOGY WORKFLOWS – D.B.KELL, MCISB

Science is data-driven “the current scientific literature, were it to be presented in semantically accessible form, contains huge amounts of undiscovered science” Peter Murray-Rust, Data-driven science

What do we provide and build? Resources: lexica, terminologies, thesauri, grammars, annotated corpora Tools: tokenisers, taggers, chunkers, parsers, NE recognisers, semantic analysers Services Proof of concept: evolving to large scale Grid-enabled services Service provision through Customised solutions

A complex problem TM involves Many components (converters, analysers, miners, visualisers,...)‏ Many resources (grammars, ontologies, lexicons, terminologies, thesauri, CVs)‏ Many combinations of components and resources for different applications Many different user requirements and scenarios, training needs Need to be active in all areas to effectively support researchers

Development strategy Re-use tools where possible Forge strategic relationships (UTokyo, IBM)‏ Use integration framework (UIMA)‏ Develop generic TM tools Customise for specific domains/scenarios (MCISB, pharmas)‏ While actively engaging with user communities (requirements, evaluation)‏ Encapsulate in services

How do we provide services? Modes of use Demonstrators: for small-scale online use Batch mode: upload data, get with link to download site when job done Web Services Embedding text mining Web Services into Workflows Some services are compositions of tools Individual tools to process user data

Services based on pre-processed collections Pre-process (analyse and index) popular collections MEDLINE Evolving to handle full text (UKPMC)‏ Provide advanced search interfaces to these Based on user scenarios Rapid results for end-user Regular up-dating of analyses carried out

NaCTeM services and tools TerMine AcroMine MEDIE InfoPubMed KLEIO FACTA

TerMine C-value is a hybrid technique; extracts multi- word terms; language independent Combines linguistic filters and statistics total frequency of occurrence of string in corpus frequency of string as part of longer candidate terms (nested terms)‏ number of these longer candidate terms length of string (in number of words)‏ Frantzi, K., Ananiadou, S. & Mima, H. (2000) International Journal of Digital Libraries 3(2)‏

The importance of acronym recognition Acronyms are among the most productive type of term variation Acronyms are used more frequently than full terms 5,477 documents could be retrieved by using the acronym JNK while only 3,773 documents could be retrieved by using its full term, c-jun N-terminal kinase [Wren et al. 05] No rules or exact patterns for the creation of acronyms from their full form

Top 20 acronyms in MEDLINE

MEDIE An interactive intelligent IR system retrieving events Performs a semantic search System components GENIA tagger Enju (HPSG parser)‏ Dictionary-based named entity recognition

Semantic structure CRP measurement does NP 13 VP 17 VP 15 SoSo NP 1 DT 2 A VP 16 AV 19 not VP 21 VP 22 exclude NP 25 NP 24 AJ 26 deep NP 28 NP 29 vein NP 31 thrombosis serum NP 4 AJ 5 normal NP 7 NP 8 NP 10 NP 11 ARG1 MOD ARG1 ARG2 ARG1 ARG2 ARG1 MOD Predicate argument relations

Abstraction of surface expressions

Info-PubMed An interactive IE system and an efficient PubMed search tool, helping users to find information about biomedical entities such as genes, proteins,and the interactions between them. System components MEDIE Extraction of protein-protein interactions Multi-window interface on a browser

Info-PubMed

KLEIO Semantically enriched information retrieval system for biology Offers textual and metadata searches across MEDLINE Provides enhanced searching functionality by leveraging terminology management technologies

KLEIO architecture

32 different ABC acronyms in MEDLINE...

Exploit named entity recognition to focus search

See detail, link to BioDBs and other resources

Refine query to human

FACTA: finding associated concepts

Nicotine and AD

Challenges for TM in Biology Terminology, terminology, terminology Named entities: same name, different species Linking forms with concepts Event extraction Corpus annotation (who does it with what accuracy and can they agree)‏ (scientific argumentation, opinion mining)‏

Tackling terminology: BOOTStrep EC collaborative project: Resource building for TM Reusable lexical, terminological, conceptual resources for biological domain (gene regulation)‏ Tool-based methodology to create and maintain resources as fully automatically as possible Process BioDBs for terms then find variants in text Relations (including facts) among entities extracted by side-effect of text analysis Raw fact store built up as process texts for resource building

BOOTStrep BioLexicon TypeEntriesVariants cell8421,400 chemicals19,637106,302 enzymes4,01611,674 diseases19,45733,161 genes/proteins1,640,6083,048,920 GO concepts25,21981,642 molecular role concepts8,85060,408 operons2,6723,145 protein complexes2,1042,647 protein domains16,84033,880 SO concepts1,4312,326 species482,992669,481 Transcr. factors160795

BOOTStrep BioLexicon Also contains Terminological verbs (759 base, 4,556 inflected forms)‏ Terminological adjectives (1,258)‏ Terminological adverbs (130)‏ Nominalized verbs (1,771)‏ Verbs have syntactic and semantic frames (entities they occur with)‏ allows finer-grained analysis All items have associated linguistic info

Challenge: Complex analysis currently requires highly customised solutions

Challenge: Dealing with full text Need to be able to handle very large amounts of text Other issues besides linguistic/NLP ones (already hard)‏ Efficiency, scalability, distributed processing Porting TM tools to UK and European Grid environment

Need for processing full texts Will allow researchers to discover hidden relationships from text that were not known before an abstract’s length is on average 3% of the entire article an abstract includes only 20% of the useful information that can be learned from text

Parallelising TM TM applications are data independent Scale linearly in an ideal world HPC implementation Scaled linearly to 100 processors Porting to DEISA to scale over 1000s processors to process TBs of data in reasonable time

TM of full texts for UK PubMed Central (UKPMC) ‏ Free archive of life sciences journals British Library, European Bioinformatics Institute & UManchester (NaCTeM, Mimas)‏ Phase 3 tasks: integration of UKPMC in biomedical DB infrastructure with TM solutions for improved search and knowledge discovery

NaCTeM in UKPMC TM “behind the scenes” on full texts Named entity recognition Link entities in texts to bioDB entries Fact extraction E.g., protein-protein interactions Add extracted info as semantic metadata Index for efficient access Semantic search capability Based on user needs, evaluation workshops

Challenge: Tool interoperability A plethora of NLP tools out there... Part-of-speech tagging: TnT tagger, Tree tagger, SVM Tool, GENIA… Chunking: YamCha, GENIA, … Named entity recognition: LingPipe, ABNER, Stanford NE Recognizer, GENIA, … Syntactic parsing: Enju, OpenNLP, Charniak parser, Collins parser… Semantic role labeling Shalmaneser Co-reference resolution...

So where is the problem? Increasing contents of annotation Increasing layers of annotation More applications of annotation Various annotation schemes tokensphrasesentities Predicate-arguments morphology syntaxsemanticsRDF Linguistics NLP Text Mining Semantic web TEIPTBXCESSciXML ?

Architecture for interoperability A good architecture should Accommodate various annotation schemas, formats of NLP tools. Provide a common exchange annotation schema. Facilitate easy communication between NLP tools. Extend to incorporate new tools Exchange schema Term extractor POS tagger Chunker Parser + tools UIMA PLATFORM

Uses of our tools and services Searching Metadata creation Controlled vocabularies Ontology building Data integration Linking repositories Database curation Reviewing Gene – disease mining Enriching pathway models Indexing Document classification …

NaCTeM phase II ( ) ‏ TM supporting service provision Web Services Embedding TM within workflows Adaptive learning Integration of data / text mining Issues Full paper processing open access collections IPR in data derived via text mining Interoperability Education and training

NaCTeM phase II ( ) ‏ Service exemplars Intelligent semantic searching for construction of biological networks Support for qualitative data analysis for social sciences Intelligent semantic search of gene-disease associations for health e-Research and e-Science Knowledge discovery Collaborative research E-publishing Personalised searching

Research feeding into services or adding to knowledge REFINE (BBSRC): enriching SBML models and integration of TM with visualisation ONDEX (BBSRC): aiding data integration Disease association mining (Pfizer, NHS)‏

ONDEX (BBSRC ) ‏

Acknowledgments Text Mining Team: 14 members NaCTeM funding agencies: Wellcome Trust Close collaboration with University of Tokyo

Acknowledgements User group Systems Biology Centres Middleware provider Usability and evaluation Service provision MIMAS

Further reading Visit out site at for TM briefing paper and other publications on our workwww.nactem.ac.uk Book: Ananiadou, S. & McNaught, J. (eds) (2006) Text Mining for Biology and Biomedicine. Norwood, MA: Artech House.