13 th September 2007 UK e-Science All Hands Meeting Text Mining Services to Support e-Research Brian Rea and Sophia Ananiadou National Centre for Text.

Slides:



Advertisements
Similar presentations
Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
Advertisements

RSP Summer School14-16 September 2009 UK Institutional Repository Search: a collaborative project to showcase UK research output through advanced discovery.
Configuration management
Product Review Summarization Ly Duy Khang. Outline 1.Motivation 2.Problem statement 3.Related works 4.Baseline 5.Discussion.
PolyAnalyst Data and Text Mining tool Your Knowledge Partner TM www
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Dr Gordon Russell, Napier University Unit Data Dictionary 1 Data Dictionary Unit 5.3.
Helping people find content … preparing content to be found Enabling the Semantic Web Joseph Busch.
Information Retrieval in Practice
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Chapter 13 The Data Warehouse
Methodology Conceptual Database Design
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Mining and Summarizing Customer Reviews
Data Mining Techniques
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Path Knowledge Discovery: Association Mining Based on Multi-Category Lexicons Chen Liu, Wesley W. Chu, Fred Sabb, Stott Parker and Joseph Korpela.
Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining
Survey of Semantic Annotation Platforms
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Flexible Text Mining using Interactive Information Extraction David Milward
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Progress Report (Concept Extraction) Presented by: Mohsen Kamyar.
CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Deep Questions without Deep Understanding
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Topic 4 - Database Design Unit 1 – Database Analysis and Design Advanced Higher Information Systems St Kentigern’s Academy.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Information Retrieval in Practice
Chapter 13 The Data Warehouse
Personalized Social Image Recommendation
Tools of Software Development
Data Warehousing and Data Mining
TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.
PROJECTS SUMMARY PRESNETED BY HARISH KUMAR JANUARY 10,2018.
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

13 th September 2007 UK e-Science All Hands Meeting Text Mining Services to Support e-Research Brian Rea and Sophia Ananiadou National Centre for Text Mining University of Manchester

Outline Recent History Document Enrichment Information Retrieval and Text Mining Text Mining Applications Case Study: ASSERT Project Case Study: BBC Online News Feeds Future Opportunities

What is Text Mining Text mining discovers and extracts information hidden in unstructured texts. It aids the construction of hypotheses based upon associations between the extracted information Due to this it can often discover things overlooked by human readers

What Text Mining is not Text mining is not based upon an understanding of document content. Instead it predicts the most likely meaning of a fragment of text based upon models of language. Text mining will generally not pick up on sarcasm, irony or other subtleties of language usage. Text mining tools must be tuned before use on different text types, styles or languages.

13 th September 2007 UK e-Science All Hands Meeting5 Recent History 1,418,949,650Number of words 3.2GBCompressed data size 10GBUncompressed data size 70,815,480Number of sentences 7,434,879Number of abstracts 14,792,890Number of MEDLINE references Suppose, for example, that it takes one second to analyse one sentence…. 70 million seconds, that is more than 2 years

13 th September 2007 UK e-Science All Hands Meeting Rapid increase in the amount of literature means it is becoming impractical to read everything in many disciplines. Text mining systems can begin to address this by automating some of the process. Without any inherent understanding of language the system must use different methods to that of a human. As such it can often discover facts or patterns that a human may easily miss.

13 th September 2007 UK e-Science All Hands Meeting Document Enhancement How do we approximate an understanding of natural language? Levels of annotation are built up in stages. Tokenisation gives us words and boundaries. Part-Of-Speech (POS) Tagging gives us a basic model with nouns, verbs, etc. –There are many methods for predicting POS –Training on hand coded documents is necessary to improve accuracy –Errors at this early stage can grow exponentially through the system

13 th September 2007 UK e-Science All Hands Meeting Document Enhancement How do we fit these words together? Grammars provide simple syntax rules for building up complex sentences based upon POS tag information. Shallow Parsing – gives information about noun and verb phrases Deep Parsing – generates complex representations of the underlying relationships between phrases

13 th September 2007 UK e-Science All Hands Meeting Document Enhancement Example: “ The MPs discussed the policy with the ambassador” This is a relatively simple example but many ways of interpreting it. Parsing techniques choose the most likely meaning based upon complex internal models. Complex sentences can take longer to process as many possibilities are available and need to be ruled out.

13 th September 2007 UK e-Science All Hands Meeting MedIE

13 th September 2007 UK e-Science All Hands Meeting Term Discovery Keywords are often used when searching within documents. This reduces the noise created by common words that carry little information. Text mining can take this a stage further and identify significant terms (multi-word units). Terms can be used to: –Gain an overview of the document contents –Assist searching by allowing query expansion and browsing –Identify important concepts for generating ontologies

13 th September 2007 UK e-Science All Hands Meeting TerMine

13 th September 2007 UK e-Science All Hands Meeting Named Entity Recognition Uses techniques to find common forms or patterns in text to identify items belonging to particular semantic categories. Different methods can be used including rule- based, template driven or machine learning. Some examples include: names, addresses, organisations, dates, times, quantities…

13 th September 2007 UK e-Science All Hands Meeting SemText

13 th September 2007 UK e-Science All Hands Meeting Document Similarity One of the most common models of document similarity is the Vector Space model. Each document is represented in a multi- dimensional space where each term acts as a dimension. The distance in that dimension is represented by the contribution or strength of that term in the given document. Similarity can be calculated using the cosine of the angle between the two vectors.

13 th September 2007 UK e-Science All Hands Meeting Dimensionality Reduction Due to computation limitations it is impractical to search the entire document space. Where possible we can reduce the space by mapping all synonyms of a term to a single label. For larger scale reduction we can use Latent Semantic Indexing which merges terms that regularly co-occur together with remarkably good results. Benefits of this include noise reduction and removal of redundant terms. Drawbacks include the expensive matrix operations involved to generate the mapping rules

13 th September 2007 UK e-Science All Hands Meeting Online or Offline Processing Many of the techniques introduced so far can be processed at any time, not just at run time. This allows us to handle the major bulk of processing well in advance of our services becoming available. For larger document collections the scale of this processing makes it impractical for a single machine. We are currently in the process of preparing our tools to allow use on the national computing resources i.e. HPC and Grid

13 th September 2007 UK e-Science All Hands Meeting Associative Search Relies upon the vector space similarity to identify a set of documents with related content to a target collection. Single document targets are treated like a normal query. Multiple document targets involve extra effort to identify the related set of terms that best represents the collection. This process not only identifies similar documents but may also recognise previously unknown yet related areas.

13 th September 2007 UK e-Science All Hands Meeting Document Similarity

13 th September 2007 UK e-Science All Hands Meeting Information Extraction IE brings together term discovery, pattern matching and named entity recognition to identify and extract facts. We define the form of the information we are interested in as fact templates. Each template has attribute slots that can be filled by named entities or other facts. Example: Person_X is a programme manager for Programme_Y with JISC.

13 th September 2007 UK e-Science All Hands Meeting InfoPubMed

13 th September 2007 UK e-Science All Hands Meeting Document Summarisation Two main methods of manual summarisation: –Abstractive – relies upon an understanding of the content to rewrite a new version in a shorter form –Extractive – draws upon key sections to form a readable shorter form Preserve the important informative content Reduce redundancy through knowledge of terms and synonyms It is much harder across multiple documents Potentially important to link back to key evidence

13 th September 2007 UK e-Science All Hands Meeting Case Study: ASSERT Multi-Document Summarisation Document Classification Document Clustering SynthesizeScreen Document Sectioning Term Extraction Search Query Expansion Sentence Extraction Document Collections Automatic Summarisation for Systematic Reviews using Text Mining

13 th September 2007 UK e-Science All Hands Meeting24 Case Study: BBC News Feeds Analyse, structure and visualise BBC news online, according to a user’s query using advanced text mining techniques Concept discovery and retrieval –interface allows a user to enter a query across the document collection and automatically calculate a list of concepts specific to the query and ranked by perceived importance. Creation of user oriented knowledge maps –Based on clusters of articles and their automatic concept categorisation.

13 th September 2007 UK e-Science All Hands Meeting Future Developments Ongoing development of key text mining services to support the UK academic community Further application of HPC and Grid technology –processing for document enhancement –handling data and processing for intermediary results –responsive and efficient service implementations Transformation of components to web services and integration with work flow solutions Investigation into interoperability issues between components and intermediary formats

13 th September 2007 UK e-Science All Hands Meeting Conclusions NaCTeM has made strong progress in –Provision of core text mining services and support –Leveraging strengths in BioSciences out to social sciences, arts and humanities Text Mining is integral to UK infrastructure for eResearch, but requires closer integration into existing research methodology and practice Links with infrastructure are essential to support scalable solutions for future challenges Interoperability between tools and formats is necessary for true flexibility between text mining components IPR issues and policy require further investigation

13 th September 2007 UK e-Science All Hands Meeting27 How to contact us Visit the Text Mining Centre Website at