Analyzing Text with SQL Server 2014, R, AND Azure ML Dejan Sarka.

Analyzing Text with SQL Server 2014, R, AND Azure ML Dejan Sarka

Introduction Dejan Sarka –dsarka@solidq.com, dsarka@siol.net, @DejanSarkadsarka@solidq.com dsarka@siol.net –Data Scientist –MCT, SQL Server MVP –30 years of data modeling, data mining and data quality 13 books ~10 courses 2

Text Mining Using R Azure ML NER Full-Text Search Statistical Semantic Search Conclusion Agenda

Text mining = analysis of text Text mining cannot be done with a single tool in SQL Server Use SSIS for preparation Use SSAS Data Mining for in-depth analysis –Can use SSAS BISM for manual analysis –Can use SSRS for presentation Introduction to Text Mining

SSIS integrates text mining into the data flow –This advanced feature can source data just like any other component, and the results can be handled just like the output from any other component –Routed, sorted, aggregated, cleansed, transformed and loaded... –Source can be a database table with a comments field; it could be XML (RSS for example) or any other text source The two components to explore are Term Extraction and Term Lookup Text Mining in SSIS

Term Extraction enables retrieving the key terms from a Unicode string or text column (DT_WSTR or DT_NTEXT) –It uses its own dictionary and linguistic information about English –Can be used with other languages, but results are worse It can extract nouns only, noun phrases only, or both nouns and noun phrases –Articles and pronouns are not extracted It breaks text into sentences, and sentences into words It normalizes capitalization of words It also stems nouns to extract the singular form Term Extraction (1)

The Term Extraction transformation scores each term –It scores terms based on a number of factors, including its English grammar and syntax –The output includes only two columns - the extracted terms and the score Score can be TF (term frequency) or TFIDF (term frequency / inverse document frequency) –TFIDF = TF * LN(n of docs with term / TF) –TFIDF lowers the score for terms that appear in many documents –Emphasizes terms that are frequent in lower number of documents Term Extraction (2)

Can use exclusion terms –If you analyze text from specific area only, some terms (e.g. “SQL Server” in texts about SQL Server) can become noisy –The transformation skips extraction of exclusion terms –Must be stored in a SQL Server or Access table It is a blocking transformation –Need to consume complete upstream data before releasing any row downstream You should test it with different options to get the result that suits your needs Term Extraction (3)

The Term Lookup transformation matches terms extracted from text in a transformation input column with terms in a reference table –Counts how many times a term appears in a document –Before performing lookup, it extracts words using the same method as the Term Extraction transformation –If a document contains terms that overlap in the reference set, it returns only one lookup result –E.g., term “Microsoft Windows Vista SP1” is in document, “Microsoft Windows” and “Windows Vista SP1” both in reference set, only “Microsoft Windows” is returned Term Lookup (1)

Reference set is a set of terms in a lookup table –Usually result of Term Extraction –Can edit it manually –Must be stored in a SQL Server or Access table It is a semi-blocking transformation –Holds up records in the Data Flow for a period of time before passes memory buffers downstream Term Lookup (2)

Mine Term Lookup results –Clustering algorithm groups documents in clusters based on similarity of term occurrences –Association Rules detects cross-correlation between key words and phrases –Classification algorithms, like Decision Trees, can use key words and phrases to predict the class of a document UDM cubes and reports can present Term Lookup results You can also present Term Extraction results directly Further Analysis

The only supported language for Term Extraction and Term Lookup is English –They have no clue about syntax and grammar in other languages Both transformations do not support custom delimiters –Makes them even less useful for non-English languages Problems

Analyzing Text in R (1) Define “corpus” – set of documents to analyze Read the documents from several sources in several supported formats –R data frame, directory, URI, R vector, XML –CSV, DOC, PDF, XML Use different transformations to prepare the text for analysis –Change special characters, lowercase, remove punctuation, remove stopwords, strip space –Use stemming 13

Analyzing Text in R (2) Create document term matrix Calculate term frequencies –Remove sparse terms –Find most frequent terms Find associations Use different plots –Term frequency –Word clouds –Term length frequency –Letter frequency 14

Analyzing Text in R (3) 15

Azure ML NER Named Entity Recognition –Identifies proper names –Classifies names to categories Categories can be universal or local –Person, location,… –DBMS, programming language,… Azure ML NER accepts two inputs –Story – texts to analyze –Custom Resources – local linguistic resources 16

Simple terms, i.e. one or more specific words or phrases Prefix terms, terms the words or phrases begin with Generation terms, meaning inflectional forms of words Proximity terms, or words or phrases close to another word or phrase Thesaurus terms, or synonyms of a word Weighted terms, words or phrases using values with your custom weight Statistical semantic search, or key phrases in a document Similar documents, where similarity is defined by semantic key phrases With FTS/SSS, Search for: 17

Install FTS/SS with Setup Install document filters –Download Office 2007 / 2010 filters an load them –Columns of data type VARBINARY, VARBINARY(MAX), IMAGE, or XML require additional type column in which you store the file extension Word breakers and stemmers perform language- specific linguistic analysis on all full-text data –Can use English if a language is not supported FTS/SSS Components (1) 18

Can prevent indexing noise words by creating stoplists of stopwords FTS finds synonyms in thesaurus files –Each language has an associated XML thesaurus file - path SQL_Server_install_path\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\FTDATA\ –You can manually edit each thesaurus file and load it Can search on document properties –Searchable properties depend on the document filter –Can create search property list FTS/SS Components (2) 19

FT indexes are stored in FT catalogs –A full-text catalog is a virtual object, a container for full-text indexes –As a virtual object, it does not belong to any filegroup A FT Index is a physical object For semantic search, install the Semantic Language Statistics Database Can test FTS parsing and stemming with sys.dm_fts_parser FTS/SS Catalogs and Indexes 20

Search for exact or fuzzy matches –SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘SearchWord1’) – simple term –SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘SearchWord1 OR SearchWord2’) – simple term with a logical operator –SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘”SearchWord1 SearchWord2”’) – phrase term –SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘”SearchWord1*”’) – prefix term The CONTAINS Predicate (1) 21

–SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘NEAR(SearchWord1, SearchWord2)’) – simple proximity term –SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘NEAR((SearchWord1, SearchWord2), distance)’) – proximity term with distance –SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘NEAR((SearchWord1, SearchWord2), distance, flag)’) – proximity term with distance and order of words (flag = True | False) The CONTAINS Predicate (2) 22

–SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘FORMSOF(INFLECTIONAL, SearchWord1)’) - generation term –SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘FORMSOF(THESAURUS, SearchWord1)’) - generation term with synonyms –SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘ISABOUT(SearchWord1 weight(w1), SearchWord2 weight(w2))’) - weighted term (not useful for CONTAINS) –SELECT…FROM…WHERE CONTAINS(PROPERTY(Ftcolumn, ‘PropertyName’), ‘SearchWord1’) - property search The CONTAINS Predicate (3) 23

The FREETEXT predicate is less specific and thus return more rows than the CONTAINS predicate –SELECT…FROM…WHERE FREETEXT(FTcolumn, ‘SearchWord1 SearchWord2’) - you are searching for rows where the FTcolumn includes any of the inflectional forms and any of the defined synonyms of the words SearchWord1 and SearchWord2 The FREETEXT Predicate 24

The CONTAINSTABLE and FREETEXTTABLE functions return two columns: KEY and RANK –The KEY column is the unique key –RANK - a value between 0 and 1000 telling you how well a row matches your search criteria The number is always relative to a query The calculation takes into account term frequency, number of words in a document, proximity terms, weight, number of indexed rows, … Different calculation for the CONTAINSTABLE and for the FREETEXTTABLE, as the later does not support majority of the parameters FTS Functions 25

SEMANTICKEYPHRASETABLE ( table, { column | (column_list) | * } [, source_key ] ) - returns a table with key phrases associated with the full- text indexed column from the column_list SEMANTICSIMILARITYDETAILSTABLE ( table, source_column, source_key, matched_column, matched_key ) - returns a table with key phrases that are common across two documents SEMANTICSIMILARITYTABLE ( table, { column | (column_list) | * }, source_key ) - returns a table with documents scored by semantic similarity to the searched document specified with the source_key parameter SSS Functions 26

FTS supports > 50 languages –Including Slovak and Slovenian SS supports 15 languages –Excluding Slovak and Slovenian A SS “term” consists of a single word So how to analyze texts in Slovak or Slovenian? a)Not with SQL Server tools b)Custom application with sys.dm_fts_parser c)Use a translator and then SQL Server tools Problems

So What Does “kuraci” Mean?

Really?

Q & A Come to SQL Saturday Ljubljana, December 12 th, 2015! Thank you! 30

Analyzing Text with SQL Server 2014, R, AND Azure ML Dejan Sarka.

Similar presentations

Presentation on theme: "Analyzing Text with SQL Server 2014, R, AND Azure ML Dejan Sarka."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Analyzing Text with SQL Server 2014, R, AND Azure ML Dejan Sarka.

Similar presentations

Presentation on theme: "Analyzing Text with SQL Server 2014, R, AND Azure ML Dejan Sarka."— Presentation transcript:

Similar presentations

About project

Feedback