Download presentation
Presentation is loading. Please wait.
Published byBenedict Grant Modified over 9 years ago
1
Text mining michel.bruley@teradata.com
Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
2
Information context Big amount of information is available in textual form in databases and online sources In this context, manual analysis and effective extraction of useful information are not possible It is relevant to provide automatic tools for analyzing large textual collections
3
Text mining definition
The objective of Text Mining is to exploit information contained in textual documents in various ways, including … discovery of patterns and trends in data, associations among entities, predictive rules, etc. The results can be important both for: the analysis of the collection, and providing intelligent navigation and browsing methods
4
Text mining pipeline Unstructured Text (implicit knowledge)
Information Retrieval Information extraction Knowledge Discovery Semantic metadata Structured content (explicit knowledge) Semantic Search/ Data Mining
5
Iterative and interactive process
Text mining process Text preprocessing Syntactic/Semantic text analysis Features Generation Bag of words Features Selection Simple counting Statistics Text/Data Mining Classification- Supervised learning Clustering- Unsupervised learning Analyzing results Mapping/Visualization Result interpretation Iterative and interactive process
6
Text mining actors Publishers Analysts Libraries Enriched content
Annotation tools Tools for authors New applications based on annotation layers Richer cross linking based on content… Analysts Empowers them Annotating research output Hypothesis generation Summarisation of findings Focused semantic search… Libraries Linking between Institutional repositories Access to richer metadata Aggregation Aids to subject analysis/classification …
7
Challenges in text mining
Data collection is “free text”, is not well-organized (Semi-structured or unstructured) No uniform access over all sources, each source has separate storage and algebra, examples: , databases, applications, web A quintuple heterogeneity: semantic, linguistic, structure, format, size of unit information Learning techniques for processing text typically need annotated training XML as the common model, it allows: Manipulation data with standards Mining becomes more data mining RDF emerging as a complementary model The more structure you can explore the better you can do mining
8
Data source administration
Intranet Internet On-line Databank Information Provider File System Databases EDMS Web Crawling XML Normalisation -subject -Author -text corpora -keywords Format filter Input Data System: This part of the system is related to the collection of the data. -Getting data from the internet with a crawler -Getting data from Online vendors -Getting data from the internal data banks Regarding the input format (physical and logical), data are physicaly reformated into html format and then it's loaded into an XML format
9
Text mining tasks Text Analysis Tools Name Extractions Term Extraction
Feature extraction Categorization Summarization Clustering Name Extractions Term Extraction Abbreviation Extraction Relationship Extraction Hierarchical Clustering Binary relational Clustering Web Searching Text search engine NetQuestion Solution Web Crawler Feature extraction tools It recognizes significant vocabulary items in documents, and measures their importance to the document content. 2. Clustering tools Clustering is used to segment a document collection into subsets, called clusters. 3. Summarization tool Summarization is the process of condensing a source text into a shorter version preserving its information content. 4. Categorization tool Categorization is used to assign objects to predefined categories, or classes from a taxonomy.
10
Information extraction
Link Analysis Query Log Analysis Metadata Extraction Keyword Ranking Intelligent Match Duplicate Elimination Extract domain-specific information from natural language text Need a dictionary of extraction patterns (e.g., “traveled to <x>” or “presidents of <x>”) Constructed by hand Automatically learned from hand-annotated training data Need a semantic lexicon (dictionary of words with semantic category labels) Typically constructed by hand
11
Document collections treatment
Categorization Clustering
12
Text Mining example: Obama vs. McCain
13
Aster Data position for Text Analysis
Data Acquisition Pre-Processing Mining Analytic Applications Gather text from relevant sources (web crawling, document scanning, news feeds, Twitter feeds, …) Perform processing required to transform and store text data and information (stemming, parsing, indexing, entity extraction, …) Apply data mining techniques to derive insights about stored information (statistical analysis, classification, natural language processing, …) Leverage insights from text mining to provide information that improves decisions and processes (sentiment analysis, document management, fraud analysis, e-discovery, ...) Aster Data Fit Third-Party Tools Fit Aster Data Value: Massive scalability of text storage and processing, Functions for text processing, Flexibility to develop diverse custom analytics and incorporate third-party libraries
14
Aster Data Value for Text Analytics
Ability to store and process massive volumes of text data Massively parallel data stores and massively parallel analytics engine SQL-MapReduce framework enables in-database processing for specialized text analytics tools Tools and extensibility for processing diverse text data SQL-MapReduce framework enables loading and transforming diverse sources and types of text data Pre-built functions for text processing Flexible platform for building and processing diverse analytics SQL-MapReduce framework enables creation of flexible, reusable analytics Embedded MapReduce processing engine for high-performance analytics
15
Aster Data Capabilities for Text Data
Pre-built SQL-MapReduce functions for text processing Data transformation utilities Pack: compress multi-column data into a single column Unpack: extract nested data for further analysis Web log analysis Sessionization: identify unique browsing sessions in clickstream data Text analysis Text parser: general tool for tokenizing, stemming, and counting text data nGram: split text into component parts (words & phrases) Levenstein distance: compute “distance” between words Custom and Packaged Analytics Aster Data nCluster App App App App App App Aster Data Analytic Foundation SQL SQL-MapReduce Data
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.