Analyzing Text with SQL Server 2014, R, AND Azure ML Dejan Sarka.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

3/5/2009Computer systems1 Analyzing System Using Data Dictionaries Computer System: 1. Data Dictionary 2. Data Dictionary Categories 3. Creating Data Dictionary.
Information Retrieval in Practice
Integrating Access with the Web and with Other Programs.
XP Chapter 3 Succeeding in Business with Microsoft Office Access 2003: A Problem-Solving Approach 1 Analyzing Data For Effective Decision Making.
Ch 4: Information Retrieval and Text Mining
Introduction to Databases CIS 5.2. Where would you find info about yourself stored in a computer? College Physician’s office Library Grocery Store Dentist’s.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
DB2 Net Search Extender Presenter: Sudeshna Banerji (CIS 595: Bioinformatics)
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Recommender systems Ram Akella November 26 th 2008.
Introduction to Full-Text Searching in SQL Server 2012 Adolfo J. Socorro, Ph.D. IT Impact, Inc.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Tutorial 11: Connecting to External Data
Chapter 5: Information Retrieval and Web Search
Information systems and databases Database information systems Read the textbook: Chapter 2: Information systems and databases FOR MORE INFO...
Overview of Search Engines
Database Software Application
Denny Cherry twitter.com/mrdenny.
Text Search and Fuzzy Matching
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
Searching Binary Data in SQL Server 2012 Steve Jones SQLServerCentral.com.
DAY 6: MICROSOFT EXCEL – CHAPTER 2 CONTD. MICROSOFT EXCEL – CHAPTER 3 Akhila Kondai September 04, 2013.
With Microsoft Office 2007 Intermediate© 2008 Pearson Prentice Hall1 PowerPoint Presentation to Accompany GO! with Microsoft ® Office 2007 Intermediate.
With Microsoft Access 2007 Volume 1© 2008 Pearson Prentice Hall1 PowerPoint Presentation to Accompany GO! with Microsoft ® Access 2007 Volume 1 Chapter.
Databases C HAPTER Chapter 10: Databases2 Databases and Structured Fields  A database is a collection of information –Typically stored as computer.
ASP.NET Programming with C# and SQL Server First Edition
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Analyzing Data For Effective Decision Making Chapter 3.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Oracle vs SQL Server Dr. Alex Wang. Oracle Text Oracle Text uses standard SQL to do almost everything. Full-text retrieval technology, deal with unstructured.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Relational Databases (MS Access)
User Experience Takes user input, displays results Search Engine Builds index, returns results Content Processing Retrieves content, prepares for indexing.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Data Management Console Synonym Editor
With Microsoft Office 2007 Intermediate© 2008 Pearson Prentice Hall1 PowerPoint Presentation to Accompany GO! with Microsoft ® Office 2007 Intermediate.
 Agenda 2/20/13 o Review quiz, answer questions o Review database design exercises from 2/13 o Create relationships through “Lookup tables” o Discuss.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.
ITGS Databases.
Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates
Advanced Tips And Tricks For Power Query
SQL SERVER DAYS 2011 Table Indexing for the.NET Developer Denny Cherry twitter.com/mrdenny.
Session 1 Module 1: Introduction to Data Integrity
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
XP New Perspectives on Microsoft Office Access 2003, Second Edition- Tutorial 8 1 Microsoft Office Access 2003 Tutorial 8 – Integrating Access with the.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
General Architecture of Retrieval Systems 1Adrienn Skrop.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
In this session, you will learn to: Create and manage views Implement a full-text search Implement batches Objectives.
Introduction to Enterprise Search Corey Roth Blog: Twitter: twitter.com/coreyrothtwitter.com/coreyroth.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Adam Koehler Index Speed Demons - How To Turbo-Charge Your Text Based Queries Using Full-Text Indexing.
CS 430: Information Discovery
What is that service I never turn on?
Chapter 5: Information Retrieval and Web Search
The ultimate in data organization
SQL Server Indexing for the Client Developer
Presentation transcript:

Analyzing Text with SQL Server 2014, R, AND Azure ML Dejan Sarka

Introduction Dejan Sarka –Data Scientist –MCT, SQL Server MVP –30 years of data modeling, data mining and data quality 13 books ~10 courses 2

Text Mining Using R Azure ML NER Full-Text Search Statistical Semantic Search Conclusion Agenda

Text mining = analysis of text Text mining cannot be done with a single tool in SQL Server Use SSIS for preparation Use SSAS Data Mining for in-depth analysis –Can use SSAS BISM for manual analysis –Can use SSRS for presentation Introduction to Text Mining

SSIS integrates text mining into the data flow –This advanced feature can source data just like any other component, and the results can be handled just like the output from any other component –Routed, sorted, aggregated, cleansed, transformed and loaded... –Source can be a database table with a comments field; it could be XML (RSS for example) or any other text source The two components to explore are Term Extraction and Term Lookup Text Mining in SSIS

Term Extraction enables retrieving the key terms from a Unicode string or text column (DT_WSTR or DT_NTEXT) –It uses its own dictionary and linguistic information about English –Can be used with other languages, but results are worse It can extract nouns only, noun phrases only, or both nouns and noun phrases –Articles and pronouns are not extracted It breaks text into sentences, and sentences into words It normalizes capitalization of words It also stems nouns to extract the singular form Term Extraction (1)

The Term Extraction transformation scores each term –It scores terms based on a number of factors, including its English grammar and syntax –The output includes only two columns - the extracted terms and the score Score can be TF (term frequency) or TFIDF (term frequency / inverse document frequency) –TFIDF = TF * LN(n of docs with term / TF) –TFIDF lowers the score for terms that appear in many documents –Emphasizes terms that are frequent in lower number of documents Term Extraction (2)

Can use exclusion terms –If you analyze text from specific area only, some terms (e.g. “SQL Server” in texts about SQL Server) can become noisy –The transformation skips extraction of exclusion terms –Must be stored in a SQL Server or Access table It is a blocking transformation –Need to consume complete upstream data before releasing any row downstream You should test it with different options to get the result that suits your needs Term Extraction (3)

The Term Lookup transformation matches terms extracted from text in a transformation input column with terms in a reference table –Counts how many times a term appears in a document –Before performing lookup, it extracts words using the same method as the Term Extraction transformation –If a document contains terms that overlap in the reference set, it returns only one lookup result –E.g., term “Microsoft Windows Vista SP1” is in document, “Microsoft Windows” and “Windows Vista SP1” both in reference set, only “Microsoft Windows” is returned Term Lookup (1)

Reference set is a set of terms in a lookup table –Usually result of Term Extraction –Can edit it manually –Must be stored in a SQL Server or Access table It is a semi-blocking transformation –Holds up records in the Data Flow for a period of time before passes memory buffers downstream Term Lookup (2)

Mine Term Lookup results –Clustering algorithm groups documents in clusters based on similarity of term occurrences –Association Rules detects cross-correlation between key words and phrases –Classification algorithms, like Decision Trees, can use key words and phrases to predict the class of a document UDM cubes and reports can present Term Lookup results You can also present Term Extraction results directly Further Analysis

The only supported language for Term Extraction and Term Lookup is English –They have no clue about syntax and grammar in other languages Both transformations do not support custom delimiters –Makes them even less useful for non-English languages Problems

Analyzing Text in R (1) Define “corpus” – set of documents to analyze Read the documents from several sources in several supported formats –R data frame, directory, URI, R vector, XML –CSV, DOC, PDF, XML Use different transformations to prepare the text for analysis –Change special characters, lowercase, remove punctuation, remove stopwords, strip space –Use stemming 13

Analyzing Text in R (2) Create document term matrix Calculate term frequencies –Remove sparse terms –Find most frequent terms Find associations Use different plots –Term frequency –Word clouds –Term length frequency –Letter frequency 14

Analyzing Text in R (3) 15

Azure ML NER Named Entity Recognition –Identifies proper names –Classifies names to categories Categories can be universal or local –Person, location,… –DBMS, programming language,… Azure ML NER accepts two inputs –Story – texts to analyze –Custom Resources – local linguistic resources 16

Simple terms, i.e. one or more specific words or phrases Prefix terms, terms the words or phrases begin with Generation terms, meaning inflectional forms of words Proximity terms, or words or phrases close to another word or phrase Thesaurus terms, or synonyms of a word Weighted terms, words or phrases using values with your custom weight Statistical semantic search, or key phrases in a document Similar documents, where similarity is defined by semantic key phrases With FTS/SSS, Search for: 17

Install FTS/SS with Setup Install document filters –Download Office 2007 / 2010 filters an load them –Columns of data type VARBINARY, VARBINARY(MAX), IMAGE, or XML require additional type column in which you store the file extension Word breakers and stemmers perform language- specific linguistic analysis on all full-text data –Can use English if a language is not supported FTS/SSS Components (1) 18

Can prevent indexing noise words by creating stoplists of stopwords FTS finds synonyms in thesaurus files –Each language has an associated XML thesaurus file - path SQL_Server_install_path\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\FTDATA\ –You can manually edit each thesaurus file and load it Can search on document properties –Searchable properties depend on the document filter –Can create search property list FTS/SS Components (2) 19

FT indexes are stored in FT catalogs –A full-text catalog is a virtual object, a container for full-text indexes –As a virtual object, it does not belong to any filegroup A FT Index is a physical object For semantic search, install the Semantic Language Statistics Database Can test FTS parsing and stemming with sys.dm_fts_parser FTS/SS Catalogs and Indexes 20

Search for exact or fuzzy matches –SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘SearchWord1’) – simple term –SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘SearchWord1 OR SearchWord2’) – simple term with a logical operator –SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘”SearchWord1 SearchWord2”’) – phrase term –SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘”SearchWord1*”’) – prefix term The CONTAINS Predicate (1) 21

–SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘NEAR(SearchWord1, SearchWord2)’) – simple proximity term –SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘NEAR((SearchWord1, SearchWord2), distance)’) – proximity term with distance –SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘NEAR((SearchWord1, SearchWord2), distance, flag)’) – proximity term with distance and order of words (flag = True | False) The CONTAINS Predicate (2) 22

–SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘FORMSOF(INFLECTIONAL, SearchWord1)’) - generation term –SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘FORMSOF(THESAURUS, SearchWord1)’) - generation term with synonyms –SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘ISABOUT(SearchWord1 weight(w1), SearchWord2 weight(w2))’) - weighted term (not useful for CONTAINS) –SELECT…FROM…WHERE CONTAINS(PROPERTY(Ftcolumn, ‘PropertyName’), ‘SearchWord1’) - property search The CONTAINS Predicate (3) 23

The FREETEXT predicate is less specific and thus return more rows than the CONTAINS predicate –SELECT…FROM…WHERE FREETEXT(FTcolumn, ‘SearchWord1 SearchWord2’) - you are searching for rows where the FTcolumn includes any of the inflectional forms and any of the defined synonyms of the words SearchWord1 and SearchWord2 The FREETEXT Predicate 24

The CONTAINSTABLE and FREETEXTTABLE functions return two columns: KEY and RANK –The KEY column is the unique key –RANK - a value between 0 and 1000 telling you how well a row matches your search criteria The number is always relative to a query The calculation takes into account term frequency, number of words in a document, proximity terms, weight, number of indexed rows, … Different calculation for the CONTAINSTABLE and for the FREETEXTTABLE, as the later does not support majority of the parameters FTS Functions 25

SEMANTICKEYPHRASETABLE ( table, { column | (column_list) | * } [, source_key ] ) - returns a table with key phrases associated with the full- text indexed column from the column_list SEMANTICSIMILARITYDETAILSTABLE ( table, source_column, source_key, matched_column, matched_key ) - returns a table with key phrases that are common across two documents SEMANTICSIMILARITYTABLE ( table, { column | (column_list) | * }, source_key ) - returns a table with documents scored by semantic similarity to the searched document specified with the source_key parameter SSS Functions 26

FTS supports > 50 languages –Including Slovak and Slovenian SS supports 15 languages –Excluding Slovak and Slovenian A SS “term” consists of a single word So how to analyze texts in Slovak or Slovenian? a)Not with SQL Server tools b)Custom application with sys.dm_fts_parser c)Use a translator and then SQL Server tools Problems

So What Does “kuraci” Mean?

Really?

Q & A Come to SQL Saturday Ljubljana, December 12 th, 2015! Thank you! 30