Mine your data: contrasting data mining approaches to numeric and textual data sources IASSIST May 2006 conference Ann Arbor, USA Louise Corti UK Data.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Taxonomy & Ontology Impact on Search Infrastructure John R. McGrath Sr. Director, Fast Search & Transfer.
Web Mining.
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Online Qualitative Data Resources: Best Practice in Metadata Creation.
ESDS Qualidata and QUADS Coordination Louise Corti Online Resources Day 15 November 2005, London.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
PolyAnalyst Data and Text Mining tool Your Knowledge Partner TM www
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
13 th September 2007 UK e-Science All Hands Meeting Text Mining Services to Support e-Research Brian Rea and Sophia Ananiadou National Centre for Text.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
1 Enriching UK PubMed Central SPIDER launch meeting, Wolfson College, Oxford Paul Davey, UK PubMed Central Engagement Manager.
1 Adaptive Management Portal April
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Information Retrieval
Data Mining – Intro.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Information Retrieval in Practice
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Intelligent Systems Lecture 23 Introduction to Intelligent Data Analysis (IDA). Example of system for Data Analyzing based on neural networks.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Flexible Text Mining using Interactive Information Extraction David Milward
Use of Hierarchical Keywords for Easy Data Management on HUBzero HUBbub Conference 2013 September 6 th, 2013 Gaurav Nanda, Jonathan Tan, Peter Auyeung,
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
Data Mining By Dave Maung.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool.
Some questions -What is metadata? -Data about data.
MedKAT Medical Knowledge Analysis Tool December 2009.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Knowledge Modeling and Discovery. About Thetus Thetus develops knowledge modeling and discovery infrastructure software for customers who: Have high-value.
Data Mining and Decision Support
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Data mining in web applications
Data Mining – Intro.
Information Organization: Overview
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Natural Language Processing (NLP)
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Vincent Granville, Ph.D. Co-Founder, DSC
CSE 635 Multimedia Information Retrieval
Web Mining Department of Computer Science and Engg.
Introduction to Information Retrieval
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Natural Language Processing (NLP)
Web archives as a research subject
Jonathan Griffin, Managing Director, IFIS Publishing &
Natural Language Processing (NLP)
Presentation transcript:

Mine your data: contrasting data mining approaches to numeric and textual data sources IASSIST May 2006 conference Ann Arbor, USA Louise Corti UK Data Archive Karsten Boye Rasmussen Department of Marketing & Management University of Southern Denmark Campusvej 55, DK-5230 Odense M.

Data and text Mining Data mining is the exploration and analysis of large quantities of data in order to discover meaningful patterns and rules Typically used in domains with structured data, e.g. customer relationship management in banking and retail Text mining – extracting knowledge that is hidden in text to present distilled knowledge to users in a concise form Can collect, maintain, interpret, curate and discover knowledge

Data Mining Data Mining originated in 90's as Knowledge Discovery or KDD Knowledge Discovery in Databases "world of networked knowledge" Directed data mining a variable (target) is explained through a model

Model & Meaning "Meaning" may be regarded as an approximate synonym of pattern, redundancy, information, and "restraint" Knowing something "It is possible to make a better than random guess" Bateson

Regression – visualization of the model Used Nissan cars of same type: price, driven kilometers, year, color, paint, rust, bumps, non-smoking, leather, etc.

Regression - Model Linear Y= α + β 1 X 1 Y= α + β 1 X 1 + β 2 X More independent variables Logistic logit(P) = log(P/(1-P)) = α + β 1 X 1 P= exp(α + β 1 X 1 ) / (1 + exp(α + β 1 X 1 )) P= exp α + β 1 X 1 / (1 + exp α + β 1 X 1 ) Quadratic.. etc. ÷

The target & the problem Context: Selling via mail or or phone or.... directed towards a person We know the previous customers (potential customers) and which of these that bought our target Problem: we have 390 sofas to sell !

Lots of other models - and lots of data Split up the huge dataset Training data Validation data Testing data

Lots of data Split up the huge dataset - random distributed Training data Validation data Testing data Target

Ranking Prospects after the target

Confusion Matrix – we do make errors Error rate: rate of misclassification (false / all) Sensitivity: prediction of true occurence (true positive / positive) (Recall) Specificity: prediction of non-occurence (true negative / negative) Precision: the truth in the prediction (true positive/predicted) But we use data with known outcome

Overfitting Error rate after iterations

Another model – the Tree

Input-1 Input-2 Input-3 Output-1 Skjult-1 Skjult-2 Neural network

Input-1 Input-2 Input-3 Output-1 Hidden-1 Hidden-2 Neural network – hidden layer

Weights in the neural network

Comparing Models

Knowledge in a pragmatic way Using the model that works ! Does not always know why it works ! Nor for how long - forever is a long time And don't know what to look out for Good exploration leads to theory, hypothesis testing, etc. Demand for huge dataset in all dimensions

From analysis of well structured data We have experience and expertice!

To analysis of unstructured data Most information is semi-structured texts: s, letters, documents, call-center, web-pages, web-blogs,...

Structure in text

Text mining Extracting precise facts from a retrieved document set or finding associations among disparate facts, leading to the discovery of unexpected or new knowledge Activities Terminology management Information extraction Information retrieval Data mining phase –find associations among pieces of information of extracted information

How can text mining help? Distill information Extract facts Discover implicit links Generate hypotheses

Entities and concepts Extraction of named entities - People, places, organisations, technical terms Discovery of concepts allows semantic annotation of documents Improves information by moving beyond index terms, Enabling semantic querying Can build concept networks from text Clustering and classification of documents Visualisation of knowledge maps

Knowledge map

Visualizing links

Popular fields for text mining Applicable to science, arts, humanities but most activity in: biomedical field identify protein genes e.g. search whole of Medline for FP3 protein activates/induces enzyme government and national security – detection of terrorist activities financial – sentiment analysis business – analysis of customer queries/satisfaction etc

Text mining tasks and resources Documents to mine texts, web pages, s Tools parsers, chunkers, tokenisers, taggers, segmentors, entity classifiers, zoners, annotators, semantic analysers Resources annotated corpora, lexicons, ontologies, terminologies, grammars, declarative rule-sets

Example: speech tagging input document with word mark-up apply tagging tool output additional mark-up of part of speech

Example: named entity tagging PICTURE HERE

Document clustering information retrieval systems based on a user-specified keyword can produce overwhelming number of results want fast and efficient document clustering – browse and organise unsupervised procedure of organising documents into clusters hierarchical approaches (partitional) K-mean variants terminological analysis based on extracted documents to identify named entities, recognise term variations perform query expansion to improve the recall and precision of the documents retrieved

Processing steps submit abstracts filter by an ontology applying criteria - date, language, author, no data reported include or exclude documents cluster by ranking auto summarise using viewpoints Use full parsing and machine learning techniques apply to test annotated corpus output relevant extracted sentences

Automatic document summarisation Document Understanding Conferences (DUC) Message Understanding Conferences (MUC) Text Summarisation Challenge (TSC) Groups undertake specified concrete tasks to generate summaries based on set queries 1. Input our extracted sentences 2. Summarise into subsections by topic 3. Extract salient information 4. Exclude redundant information 5. Maintain links from summaries to the source documents

Social science and text mining in UK text mining not been applied to social science data - to published reports nor raw data two realistic social science applications: helping with new field of systematic review of social science research from published abstracts helping process (enrich) shared qualitative data sources for web publishing and sharing both relatively new fields – last 10 years UKDA and Edinburgh/Manchester/Essex NLP and text mining connections are a first in UK/Europe

Limitations of basic NLP tools plethora of tools across institutes many tools are individually honed for specific purposes e.g. biomedical applications often tools and output from tools are non- interoperable - hard to bolt components together NLP tools are ugly – unix/linux command-line programs communicate via pipes often useful to draw on range of existing tools for different processing purposes

Text mining services Centre for Text Mining in the UK develop tools - demonstrators processing service with packaging of results best practice, user support and training access to ontology libraries access to lexical resources – dictionaries, glossaries and taxonomies data access, including annotated corpora grid based flexible composition of tools, resources and data..portal and workflows

The power of the GRID at present, social science problems have typically not required huge computational power computational power is needed for undertaking large- scale data and text mining searching for a conditional string across millions of records can take hours data grid useful for exposing multiple data sources in a systematic way using single sign on procedures

Mining and the GRID parallel power distribute processes over lots of machines use parallel algorithms to speed up processing tasks access to distributed data and models multiple pre-processed textual data distributed annotation of text models with provenance metadata processing pipeline distributed tools/components are hosted at different sites but what about curation, exposure and systematic description of data sources?

Challenges for mining maximise the interoperability of processing resources maximise shared data and metadata resources in a distributed fashion enable simplified yet safe sharing and respect for ownership innovative methods of visualisation hide any nasty behind the scenes business from the average user (processing programs, authentication middleware etc) New Web Services, registries, resource brokers, and protocols juggling data dimensions from atomic data to aggreggations

? Thanks Louise Corti & Karsten Boye Rasmussen