Text Mining SQL Saturday

Text Mining SQL Saturday
Lynn Ballard @widowpage

2 Main Takeaways Introduction to basic text mining techniques
Information to help you use SQL Saturdays better

Agenda Obtaining the Data SSIS to generate most commonly used words
R and Python for word clouds, sentiment analysis and topic modeling SSIS for keywords/key phrases Questions Speech Review

Obtaining the Data PASS SQL Saturday’s XML feed is available at

Resulting Table Structures
Events Id Name City State StartDate Country Speakers EventId ImportId Name Description Twitter Gender Speeches EventId ImportId Track Title Desccription

Background on SQL Saturday
Events USA Outside Largest Outside US Num of Speeches Largest # of Speakers 2007 1 2008 8 2009 16 2010 32 2011 44 37 7 Canada/South Africa (2) 2012 79 47 Austrailia (5) 2013 85 43 42 UK/Italy (3) 2014 90 49 41 UK/Brazil (3) 2015 117 51 54 UK (5) 2,887 Alpharetta (83) 2016 104 50 Brazil (4) 3,167 Alpharetta (105) 2017 113 63 Brazil (6) 3,310 Chattanooga (83) 2018 48 65 Brazil (7) 3,338 Atlanta (68) 2019 89 36 53 2,154 Atlanta (95) As of June 11

Largest Events 2016 2017 2018 2019 Atlanta (105) Chattanooga (83)
Phoenix (76) Nashville (70) Baton Rouge (66) Baton Rouge (70) Nashville (72) NYC (68) Nashville (61) Dallas (66) Huntington Beach (70) Atlanta (67) Dallas\Indianapolis (58) LA (64) Baton Rouge (65) Indianapolis (61) Chicago\Portland (57) Orange County (63) South Florida (57) Phoenix (60) Denver (56) Chicago (60) Chicago (56) Chicago/Baton Rouge (58) Houston\OC (55) Nashville\S Florida (57) Houston (55) South Florida (56) NYC (53) Phoenix (55) Dallas (53) LA (51) Philly (52) Cleveland (48) Indianapolis/Tampa (52) Charlotte (50) Tampa (49) Tampa/Madison (47) Average was 43 Average was 39 Average was 40 Average is 37

Closest SQL Saturday Events
Driving Time Date Speaker Count SQLSaturday #698 - Nashville 2018 8.5 1/13/18 61 SQLSaturday #708 - Cleveland 2018 8 2/3/18 36 SQLSaturday #719 - Chicago 2018 3 3/17/18 57 SQLSaturday #701 - Cincinnati 2018 7 37 SQLSaturday #724 - Madison 2018 4/7/18 45 SQLSaturday #752 - Iowa City 2018 6/23/18 25 SQLSaturday #729 - Louisville 2018 7/21/18 41 SQLSaturday #736 - Columbus 2018 7/28/18 44 SQLSaturday #745 - Indianapolis 2018 8/11/18 58 SQLSaturday #768 - Wausau 2018 4 9/8/18 19 SQLSaturday #796 - Minnesota 2018 6 10/6/18 46 SQLSaturday #815 - Nashville 2019 1/12/19 56 SQLSaturday #821 - Cleveland 2019 2/2/19 48 SQLSaturday #825 - Chicago 2019 3/23/19 55 SQLSaturday #827 - Cincinnati 2019 3/30/19 SQLSaturday #842 - Madison 2019 4/6/19 SQLSaturday #861 - Columbus 2019 6/8/19 SQLSaturday #882 - Iowa City 2018 6/22/19 24

Categories Enterprise Database Administration Deployment 2705
BI Platform Architecture, Development Administration/BI/ BI Information Delivery 2657 Application Database Development 1805 Professional Development 683 DBA 666 Cloud Application Development Deployment 608 Analytics and Visualization 587 Other 402 Strategy and Architecture 385 Advanced Analysis Techniques 332

Category Comparison First Year Last 365 Days
Enterprise Database Administration Deployments 333 Application Database Development 199 Other 116 BI Platform Architecture, Development Administration 113 Professional Development 110 DBA 79 BI Information Delivery 70 Business Intelligence 51 Strategy and Architecture 42 Track 3 35 Cloud Application Development Deployment 34 Analytics and Visualization 33 Advanced Analysis Techniques 31 BI 30 Database Administration 29 Enterprise Database Administration Deployment 597 Application Database Development 555 BI Platform Architecture, Development Administration 351 Cloud Application Development Deployment 253 Analytics and Visualization 232 Professional Development 165 Strategy and Architecture 132 Advanced Analysis Techniques 120 Other 81 BI Information Delivery 79 Database Administration 65 Business Intelligence Development 38 BI 26 Database Administration and Development

SSIS for Word Counts

Text Mining: Word Counts with SSIS
extraction-component/ With SSIS experience < 30 minutes Without SSIS experience < 2 hours

SSIS Text Extraction for Word Count

Results session 14,263 business 2,497 Server 14,148 DBA 2,437 database
6,172 environment 2,351 performance 5,253 application 2,292 tool 3,578 code 2,256 time 3,415 Services 2,230 BI 3,205 SSIS 2,129 query 3,166 databases 2,064 feature 3,078 power 2,977 Azure 2,965 Microsoft 2,879 table 2,762

These Words Don’t Really Help
session thing step world example way option day topic end presentation talk part need today practice work use attendees organization lot company people advantage basic concept business tip knowledge

Telling SSIS to Exclude Certain Words

Results data 17,441 Microsoft 2,879 system 2,053 analysis 1,682 SQL
16,616 table 2,762 developer 2,052 change 1,674 Server 14,148 solution 2,569 index 1,989 PowerShell 1,659 database 6,172 DBA 2,437 report 1,984 Management 1,566 performance 5,253 environment 2,351 process 1,820 development 1,555 tool 3,578 application 2,292 model 1,807 information 1,541 time 3,415 code 2,256 user 1,790 issue 1,499 query 3,166 Services 2,230 problem 1,773 design 1,498 power 2,977 SSIS 2,129 Package 1,732 cloud 1,497 Azure 2,965 databases 2,064 plan 1,691 type 1,478

Advanced Options

Results data SQL Server 7,007 developer 1,228 demo 854 DBAs 761 5,346
PowerShell 1,159 user 852 change 733 time 2,189 SSIS 1,123 cloud 844 technique 711 performance 1,802 databases 1,110 environment 843 technology 628 tool 1,631 server 987 report 832 difference 589 DBA 1,549 problem 952 year 826 issue 551 database 1,409 table 921 process 813 backup 550 SQL 1,407 index 897 solution 811 method 543 query 1,397 code 884 information 798 job 519 Microsoft 1,282 application 875 system 771 package 513

TFIDF Term Frequency–Inverse Document Frequency
Calculated using the formula (Frequency of the term) * log((Number of input rows)/(Number of rows containing the term)), Uniqueness of a specific term across all input rows. A higher TFIDF value indicates that the term appears only in few of them and therefore, it is more relevant for proper categorization

Results SQL Server 8716.134 Microsoft 3512.296 report 2793.588 change
data SSIS application DBAs time developer user technique performance databases demo technology tool index environment difference DBA server process backup query table information execution plan PowerShell problem solution package SQL cloud system Biml database code year issue

All Events Vs. Chicago Vs. Iowa City
Term Score SQL Server 3459 27 22 data 3031 19 Data 12 time 1179 PowerShell database 9 performance 961 tool 10 user 7 Power BI 851 query Microsoft 806 Data Warehouse DBA 797 Azure 747 8 Temporal table 6 725 cloud Query 671 CosmosDB SQL 660 model Application 5 developer 634 system Developer 614 Change databases 566 PBI 540 Machine 4 server 496 security Deadlock table 486 demo problem 484 index Powershell 470 information Performance year 469 memory SSIS

Visualizing Results This is where we start talking about python and R

Go from this

Term Score SQL Server 3459 SQL 660 code 465 DBAs 370 issue 284 data 3031 developer 634 solution 451 report 365 job 274 time 1179 PowerShell 614 index 443 change 358 type 273 performance 961 databases 566 application 429 technology 348 package 268 Power BI 851 cloud 540 environment 428 Azure 344 T-SQL 266 tool 806 server 496 process 422 technique 339 statistic 260 DBA 797 table 486 information 414 Biml 331 Azure SQL Database 255 database 747 problem 484 SSIS 393 execution plan 315 procedure Microsoft 725 demo 470 user 376 backup 297 number 248 query 671 year 469 system 374 difference 296 data warehouse 244

To This

R and Python Packages R packages include RODBC Tm Wordcloud Tidytext
Dplyr Python packages include Pyodbc Numpy Pandas Matplotlib Pillow Wordcloud

Programming Workflow Obtain data (RODBC or PYODBC) Data cleansing
1)Create your connection to your database 2)Read data into a variable Data cleansing Remove extra spaces Convert to lower case Remove punctuation Remove stop words Create wordcloud

Obtain Data Populate variable with database connection
dbconnection <- odbcDriverConnect(connection=“Driver={SQL Server Native Client 11.0}; server=(name); database=(name); trusted_connection=yes;") Read table data into variable speeches2019 <- sqlFetch(dbconnection, 'Vw_SpeechesUSA (view or table)’, colnames=FALSE, rows_at_time=1000)

Data Cleansing Convert to lower case
speeches2019_text<- tolower(speeches2019_text) Remove punctuation speeches2019_text <- gsub("[[:punct:]]", "", speeches2019_text)

Remove Stop Words #create corpus
speeches2019.text.corpus <- Corpus(VectorSource(speeches2019)) #clean up by removing stop words speeches2019.text.corpus <- tm_map(speeches2019.text.corpus, removeWords, exclusions_text) speeches2019.text.corpus <- tm_map(speeches2019.text.corpus, removeWords, stopwords_text)

Create WordCloud speechmatrix <- TermDocumentMatrix(speeches2019.text.corpus, control = list(wordLengths = c(1, Inf))) speechmatrix <- sort(rowSums(speechmatrix), decreasing = T) wordcloud(words = names(speechmatrix), freq = speechmatrix, min.freq = 4,colors=brewer.pal(8, "Dark2"),random.color = T,random.order = F, max.words = 150)

Sentiment Analysis Using crowd-sourced dictionaries to determine the emotions within the text. There are 3 main dictionaries: AFINN scores sentiment in a negative to positive range using numbers from -5 to 5. Bing uses a binary rating of positive and negative NRC categorizes words into positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise and trust.

Sentiment Analysis Code
speechmatrix inner_join(get_sentiments("bing")) count(word, sentiment, sort = TRUE) acast(word ~ sentiment, value.var = "n", fill = 0) comparison_cloud(colors = c("gray20", “blue"), max.words = 100)

Topic Modeling Instead of classifying individual words, what if we did entire documents?

Topic Modeling Classification of entire documents is based on the same principal of clustering numeric data. Your code will find natural groups of items and organize documents into those groups. Latent Dirichlet allocation (LDA) is the most popular method of creating topic models. LDA treats each document as a mixture of topics and then treats those topics as a mixture of words. This means there might be some overlap in the results rather than discrete groups. 2 relevant R packages Topicmodels package provides per-document-per-topic probability which is called gamma Tidytext package provides per-topic-per-word probability which is called beta 2 relevant Python libraries gensim nltk

Word_Counts data frame

Original Text for “Fishing for Answers”
Did you ever hear someone say “make your data lake your staging area for your data warehouse” or “a data lake can handle any data format” or “if I build a data lake data everyone will use it”? Really? Are these good ideas and true statements? Is there a design strategy that I should follow so I don’t end up with a “data swamp”? Are there tools or techniques to load data into the lake and get data out easily? Can I easily visualize my data in the lake? What about security? And finally, how does a data lake really fit into my BI landscape when I have so many infrastructure and tool options to pick from? This session will demo putting data into a data lake, pulling data, visualizing data, and testing data lake performance. I will evaluate the strengths, weaknesses and implementation approaches. At the end of the session everyone should have a clear picture of what Azure Data Lake is, how to implement it, and is it a good fit for your organization.

With Better Context Did you ever hear someone say “make your data lake your staging area for your data warehouse” or “a data lake can handle any data format” or “if I build a data lake data everyone will use it”? Really? Are these good ideas and true statements? Is there a design strategy that I should follow so I don’t end up with a “data swamp”? Are there tools or techniques to load data into the lake and get data out easily? Can I easily visualize my data in the lake? What about security? And finally, how does a data lake really fit into my BI landscape when I have so many infrastructure and tool options to pick from? This session will demo putting data into a data lake, pulling data, visualizing data, and testing data lake performance. I will evaluate the strengths, weaknesses and implementation approaches. At the end of the session everyone should have a clear picture of what Azure Data Lake is, how to implement it, and is it a good fit for your organization. Data Lake 6 Data warehouse 1 Azure Data Lake 1 Data Swamp 1 Data 8

Original Text for “Architecting for Active/Active Operations”
Get real-world advice on how to support active/active ops. You'll learn: * pros and cons of building active/active into the app code * how cloud infrastructure helps the cost model for active/active ops but can complicate your design * reference architectures for successful customer implementations * case studies of companies running active/active data centers Attendees attending the session will: •Understand the application impact of building for active/active ops •Learn best practices for scaled out, distributed SQL Server clusters to limit the impact of database downtime on application availability •Get best practices from customers on enabling active/active operations across data centers

With Better Context Get real-world advice on how to support active/active ops. You'll learn: * pros and cons of building active/active into the app code * how cloud infrastructure helps the cost model for active/active ops but can complicate your design * reference architectures for successful customer implementations * case studies of companies running active/active data centers Attendees attending the session will: •Understand the application impact of building for active/active ops •Learn best practices for scaled out, distributed SQL Server clusters to limit the impact of database downtime on application availability •Get best practices from customers on enabling active/active operations across data centers Instead of 12 counts of the word “active”, it would be more meaningful to have 6 counts of “active/active’

Word_Counts data frame

But Let’s Go Down The Current Path
chapters_lda <- LDA(chapters_dtm, k=4) speechterms <- terms(chapters_lda, 10) speechterms

Is Pairwise_Count Our Solution?
Pairwise_Count() counts the number of times paired words occur in our text. speech_word_pairs <- word_counts %>% pairwise_count(word, ImportId, sort = TRUE, upper = FALSE) speech_word_pairs

Results

Term Score SQL Server 3459 SQL 660 code 465 DBAs 370 issue 284 data 3031 developer 634 solution 451 report 365 job 274 time 1179 PowerShell 614 index 443 change 358 type 273 performance 961 databases 566 application 429 technology 348 package 268 Power BI 851 cloud 540 environment 428 Azure 344 T-SQL 266 tool 806 server 496 process 422 technique 339 statistic 260 DBA 797 table 486 information 414 Biml 331 Azure SQL Database 255 database 747 problem 484 SSIS 393 execution plan 315 procedure Microsoft 725 demo 470 user 376 backup 297 number 248 query 671 year 469 system 374 difference 296 data warehouse 244

Noun Phrases

With Better Context Did you ever hear someone say “make your data lake your staging area for your data warehouse” or “a data lake can handle any data format” or “if I build a data lake data everyone will use it”? Really? Are these good ideas and true statements? Is there a design strategy that I should follow so I don’t end up with a “data swamp”? Are there tools or techniques to load data into the lake and get data out easily? Can I easily visualize my data in the lake? What about security? And finally, how does a data lake really fit into my BI landscape when I have so many infrastructure and tool options to pick from? This session will demo putting data into a data lake, pulling data, visualizing data, and testing data lake performance. I will evaluate the strengths, weaknesses and implementation approaches. At the end of the session everyone should have a clear picture of what Azure Data Lake is, how to implement it, and is it a good fit for your organization. Data Lake 6 Data warehouse 1 Azure Data Lake 1 Data Swamp 1 Data 8

Resulting Phrases

Original Text for “Architecting for Active/Active Operations”
Get real-world advice on how to support active/active ops. You'll learn: * pros and cons of building active/active into the app code * how cloud infrastructure helps the cost model for active/active ops but can complicate your design * reference architectures for successful customer implementations * case studies of companies running active/active data centers Attendees attending the session will: •Understand the application impact of building for active/active ops •Learn best practices for scaled out, distributed SQL Server clusters to limit the impact of database downtime on application availability •Get best practices from customers on enabling active/active operations across data centers

Resulting Phrases

SQL Server 7007 performance issue 253 Query Store 194 performance tuning 157 execution plan 503 SQL Server instance 237 query performance 192 SQL Server Reporting Services 156 data warehouse 478 Visual Studio 230 query optimizer 188 performance problem 154 SSIS package 413 Power View 218 SQL Server Integration Services 184 data quality 150 big data 373 Reporting Services 214 transaction log 182 SQL Server Analysis Services 144 Extended Events 291 SQL Azure Availability Groups 181 SQL Server environment 143 Azure SQL database 280 tabular model 212 Microsoft SQL Server 174 database professional 141 Analysis Services 269 high availability 207 data professional 170 deep diva 131 different type 266 Power Query 206 columnstore index 164 Service Broker 129 SQL Servers 261 data source 198 Disaster Recovery 158 Extended Event

Word Cloud Code speechphrases <- sqlFetch(dbconnection, 'Vw_ViewforFinalWordcloud', colnames=FALSE,rows_at_time=1000) wordcloud(words = speechphrases$Term, freq=speechphrases$Score, min.freq = 4, colors=brewer.pal(8, "Dark2"),random.color = T,random.order = F, max.words = 200)

Pre 2016 Post 1/1/2016 Term Score SQL Server 3893 3247 data warehouse
285 execution plan 283 SSIS package 246 Azure SQL Database 263 Big Data 237 211 230 Query Store 185 Power View 205 174 SQL Azure 204 big data 147 different type 190 query performance 124 Analysis Services 188 Availability Groups 123 SQL Servers 186 Extended Events 117 Reporting Services 183 SQL Server instance 112 179 performance issue 108 Visual Studio 154 data professional 106 tabular model 150 temporal table 105 Power Query 104 query optimizer 133 data analysis 100 130 Azure Data Factory 99 SQL Server Reporting Services 127 data science 98 SQL Server Integration Services High Availability 93 columnstore index 120 data scientist 92 data source 119 high availability 116 data lake 91 data quality 114 Azure ML 81 transaction log disaster recovery 80 Pre 2016 Post 1/1/2016

Remember this? Past 365 Days First 365 Days
Enterprise Database Administration Deployment 331 333 Application Database Development 260 199 BI Platform Architecture, Development Administration 180 Other 116 Professional Development 130 113 Cloud Application Development Deployment 102 110 Analytics and Visualization 68 DBA 79 Strategy and Architecture 56 BI Information Delivery 70 43 Business Intelligence 51 Database Administration 42 29 Track 3 35 Advanced Analysis Techniques 28 34 19 33 Database Development 18 31 Design BI 30 Enterprise Database Administration, Deployment Monitoring Development DBA - Administration PowerShell 36 Track 1 Enterprise BI 25 BI 1 24

Application Development BI Platform Professional Development Cloud
Enterprise DBA Application Development BI Platform Professional Development Cloud Analytics SQL Server 1024 555 Power BI 207 38 Azure SQL Database 88 140 new feature 101 execution plan 125 185 positive solution 19 78 95 93 query performance 55 data warehouse 86 soft skill 17 Azure Data Factory 28 data analysis 36 Query Store 92 37 SSIS package 84 personal brand Azure SQL DB 25 R language 26 Extended Events Service Broker Analysis Services 53 career path 15 Forensic Analytics in-memory table 33 Business Intelligence Markup Language 34 Lone DBA 13 Azure Analysis Services 16 Azure ML 21 Availability Groups 79 Visual Studio 32 database person SQL server SQL Server R Services regular expression 67 Dynamic SQL dimensional modeling 27 social media 12 Azure Data Lake 14 data science performance issue 59 window function SQL Server Integration Services 24 family Issues 11 CSV file Power Query SQL Server instance unit testing new technology healthy balance Microsoft Azure data scientist SQL Servers 47 database development tabular model common marriage Azure Functions real-world situation disaster recovery T-SQL code data source technical blog 10 Azure SQL Data Warehouse data analytical method transaction log graphical execution plan data lake 20 gender bias powerful open source package Availability Group bad data 23 Gender Bias Cosmos DB Machine Learning database developer 22 maintenance time first-hand experience Amazon Web Services R package ecosystem High availability Unit test Data Warehouse time thinking Azure portal core construct Temporal table Query plan Big data dream job Cloud application premiere language Database administrator 30 Entity Framework Different way data professional Right choice 9 flat file In-Memory OLTP 29 Data type SSIS development open discussion Cloud platform Machine Learning Services

Enterprise DB Administration
Availability Groups 133 3 Extended Events 128 execution plan 2 new feature 101 disaster recovery 93 memory grant Query Store 92 resource limit Azure SQL Database 84 SQL Server instance regular expression 67 Managed Instance performance issue 59 Resource Governor 55 38 transaction log 34 high availability 32 temporal table database administrator 30 All vs Chicago

Application Database Development
execution plan 125 database 5 query performance 55 index new feature 37 deadlock 4 Service Broker 36 goal in-memory table 33 performance Visual Studio SELECT statement 3 Dynamic SQL 32 system window function 27 table unit testing 26 information database development function T-SQL code 25 query graphical execution plan/query plan 45 Encrypted bad data 23 column database developer 22 procedure unit test Entity Framework 21 Application Database Development All vs. Chicago

BI Platform Power BI 207 SSIS package, Framework, development 4
data warehouse 86 Biml 3 SSIS package 84 ETL Business Intelligence Markup Language 69 Data Vault modeling/data modeling/modeling Analysis Services 53 2 new feature 32 dimensional modeling 27 SQL Server Integration Services 24 new technology tabular model 21 data source data lake 20 Azure Data Factory maintenance time All vs. Chicago

Cloud Azure SQL Database 113
data science, scientist, problem, solution 5 Azure Data Factory 28 Azure Analysis Services 4 Power BI 19 Azure Databricks 3 16 application developer 2 Azure Data Lake 14 automated testing CSV file 13 Azure Container Registry Microsoft Azure Azure Functions 12 Azure DevOps Azure SQL Data Warehouse Visual Studio Azure Kubernetes Services Cosmos DB 11 Amazon Web Services cloud computing, platform container-based application All vs. Chicago

Analytics Power BI 140 data analysis, analytical method 2
Azure ML, machine learning 37 Python language data analysis 36 real-time analytics data science, scientist 32 text mining technique R language 26 Forensic Analytics 25 SQL Server R Services 19 Power Query 15 data analytical method 13 powerful open source package real-world situation core construct 12 flat file premiere language R package ecosystem All vs. Chicago

Revisiting Sentiment Analysis
Bing, Loughran, AFINN are 3 different lexicons or sentiment maps. Each rate sentiments on a different scale Bing is positive/negative Loughran is positive, negative, litigious, uncertainty and constraining AFINN rates sentiment on a scale from -5 to 5

Bing Sentiment Analysis

AFINN Sentiment Analysis

Loughran Sentiment

Questions

Text Mining SQL Saturday

Similar presentations

Presentation on theme: "Text Mining SQL Saturday"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text Mining SQL Saturday

Similar presentations

Presentation on theme: "Text Mining SQL Saturday"— Presentation transcript:

Similar presentations

About project

Feedback