1 Windows Azure SQL Database (WASD) Troubleshooting I will assume basicSQL Server knowledgeBob WardPrincipal Architect Escalation Engineer2 minsAssume SQL knowledgeWe will fly fastQuestions at the endLast slide has link to download slides and scripts
2 Prepare React Prevent My Goals for You Today 1 min Prepare – What you learn about problems could help you prepare better to develop and deploy Azure Databases applicationsPrevent – Learning about what could go wrong could help you prevent them from happeningReact – Learning this information can help you be better prepared to react quickly and efficiently should they happen.
3 What Will We Cover Today The Azure Troubleshooting ChallengeTroubleshooting ConnectivityWASD ErrorsQuery PerformancePractical Advice and Tips30 secsLess on Query Performance because troubleshooting is very similar to the “box”
4 The Azure Troubleshooting Challenge WASD is a platform service (PAAS)This is not a VM running SQL Server “box” (IAAS)Multi-tenant platformYou are sharing a SQL instance with other databases from other customersYou are abstracted from the SQL Server instance, Windows, and computer serverLess admin tasks means lower TCO but also means less accessYou are isolated to a specific databaseYou have a logical server and a master but most things are done in your databaseMost things are database scoped (Ex. DMVs)We make decisions to maximize all database availabilityApplication design may be requiredThe service can be updated far quicker than the “box” product3 mins
5 WASD Connectivity Errors Use min30sec logintimeoutWASD specific errorsFirewall blocked in AzureWindows authentication not supportedInvalid login – Invalid account or passwordDenial of Service – After a large number of login failuresNetwork related errors“…Server not found”Connection Timeout ExpiredMsg 121 “.. The semaphore timeout period has expired”You could lose connectivityIdle connections terminated after 30 minutes (Msg and 10054)We may forcibly disconnect on failover/some errors or change to MAXSIZERetries you need to take into account3 minsOnly call out Firewall, Windows Auth, and DOSOnly call out Connection Timeout ExpiredTalk about all in last columnLogin timeout of 30 secs – You need this to be able to meet our SLA requirements. Note I hit many of these network and conn errors while on a flight using gogo wifi. Using a 30sec login timeout helped me connect in many cases.
6 Example Connectivity Errors Network latencyBe sure to givethis to support40XXX errorsunique to WASDMay see this afterdeleting a serverAfter getting dropped onidle connection3 minsLogin is a two-phase protocol for SQL Server: pre-phase and post-phase so there are more opportunities to timeout or lose a connection when going from your client to the cloud.The last error occurred after I deleted the server in the portal and tried to connect.. In this situation I could telnet and ping but couldn’t connect. If you use the portal you can see that server is not listedConsider this website
7 Troubleshooting Connectivity Configuration issuesWASD Firewall and your firewallAllow Windows Azure ServiceIs it our service or your internet?Windows Azure Management PortalWindows Azure Service DashboardWindows Azure SQL Database Connectivity Troubleshooting GuideGeneral Tools to useping.exe, telnet.exe, tracert.exeSQL Server 2012 Management Studio – Free with SQL Server 2012 Expressostress.exe and sqlcmd.exe (username server name>)SQL Database Management Portal – https://<servername>.database.windows.netNew System Views (Event Tables) – in master databasesys.event_logsys.database_connection_stats5 minsTracert.exe. If I don’t see a MS network show up how do I know if it a problem with the MS network or my network. My suggestion is to tracert.exe when all is well and when you see a network like this one (msn is the keyword) you have hit the MS network successfully.xe ch1-16c-1b.ntwk.msn.net [ ]The end of this ntwk.msn.net is the keyWindows Azure Portal gets servers from subscriptions but list of databases per server by hitting our clustersSQL Database Management Portal web page hits an endpoint in our clusters and is useful to test connectivityEvent Tables – I usually see about a 10-15minute delay in entries in these tables but I’ve seen it take sometimes up to 30minutesHistory tables – not real time
8 Demo Tools for Connectivity 5 mins 1. Show Azure Portal and various config featrues. Tak about what it means to see have the WA portal be successful.2. Show Service Dashboard and how to read history3. Run telnetme.cmd and show blank means it “worked”. Note you need Ctrl+”]” to get out of this.4. Show picture of tracert and point out DNS resolution for server name and which route point shows the MS network.Now use server atrc45thlv.database.windows.net which is the West Europe Data Center in Amsterdam4. Connect from SQL Management Portal and talk about what that means. Show how to take server name and use to conrect directly to it5. From the context of master, show Event Table entries for connectivity using script in connectivity\event_tables.sql. Point out that if you connect from SSMS and get a login failure you will not get a database context in event table. Also point out that the a row can be updated after first seen if something changes in that interval.
9 WASD Errors full list here Failover Governance and Quota Throttling LimitsEngine Throttling“Not supported”Database copyFederationThese can result inconnection terminationand possible future rejection of workMany “box” errors still apply – Ex = deadlockMsg 40XXX range can be seen in sys.messages in SQL Server 20123 minsMost of the focus is on governance, and throttling.Need to make a comment that we are evolving the service so may be changing over time governance, quotas, throttling, and throttling limits to make the service more reliable and predictable.
10 Failover SHUTDOWN is in progress. We may decide to “move you” to a replica of your database to another serverYour database, the instance, or the computer is “unhealthy”We may need to patch the instance and/or computerWhat will you see?Msg 40197“..Server not available”Implement retry logic in your applicationSHUTDOWN is in progress.3 minsThe partition is in transition and transactions are being terminated.
11 Resource ID : 1 = worker threads GovernanceMax number of concurrent worker threads (currently 180) per databaseMsg if you exceed the limitConnection terminated. Retry when your concurrent work subsidesCheck for blocking problems or inefficient queriesMsg if the overall system has too many workersYou may get less than 180 maxConnection terminated. You can retry but it may take longer to stabilizeStill could be an application issue but a service issue could also be occurringResource ID : 1 = worker threads3 mins
12 Quotas Quota errors for space used Msg when you run out of space for your max size for your dbOnly reads and DELETE/DROP allowed until you free up spaceUse sys.dm_db_partition_stats to find what is consuming spaceSolutionsIncrease max sizeDelete data or drop tables/indexesPartition out databaseBut…freeing up may not be immediately recognizedChanging MAXSIZEdisconnects all users2 minsDon’t spend much time here. Briefly talk different message for “out of space” than on box, different DMV to find space usage, and that freeing up may not happen immediately.
13 Throttling Limits Error Condition Rebuild index Online 40549 We have a service called a “Watchdog Service” querying the instance for “conditions” to terminate connections to prevent resource problems.We also call these “Watchdogs alerts”We will kill the session with a “reason”. The “reason” is the error message you getApplication gets an error message (high severity) and connection terminated (KILL/ROLLBACK status)Sometimes retry works but these usually require some change on your partthrottling_long_transaction in sys.event_logWe monitor all databases and look for conditions to prevent problemsErrorCondition40549Session blocking system task for long period of time (20 secs)40550Session is consuming too many locks (1 million)40551Session is consuming too much tempdb space (5Gb)40552Transaction consuming too much log space or active transaction preventing log truncation40553Session consuming memory (16Mb) and there are memory waits (20secs)Rebuild indexOnline3 minsNOTE: In my testing for 40552, I was disconnected with Msg and did not receive this error.
14 Engine ThrottlingThis is more of a legacy monitoring method used to keep instances healthyAnother external service monitors the health of the instance and computerSoft throttling – we have detected a resource issue so pick specific databasesHard throttling – entire instance at risk so all databases are affectedHow it WorksExisting requests run to completionNew requests for existing connections and new connections may get Msg and connection terminated depending on type of requestReason code in Error has more details on soft vs hard, what will be rejected, and whythrottling in sys.event_log3 mins0x8003x03 = RejectAllx80 = Hard Throttling on I/ODecode reason codesAnother resource
15 “Not Supported” Errors USE <db> not supported – specify when connectingALTER DATABASE supported minimally (Ex. Name, Edition, MAXSIZE, READ_ONLY)All DBCC commands not supported except for DBCC SHOW_STATISTICSDatabase scoped DMVs supportedFeature Support for Windows Azure SQL DatabaseUnsupported Transact-SQL Statements (Windows Azure SQL Database)Partially Supported Transact-SQL Statements (Windows Azure SQL Database)1 minGo over quickly or skipPay attention to this web link
16 Demo Using Event Tables to Troubleshoot WASD Errors 10 mins Connect with SSMS with context of master and show sys_event_log using errors\event_log_errors.sql to show throttling errors and deadlocks I encounteredTalk about how I caused the throttling limit errorsShow deadlock error and how it is the same deadlock XML as produced by trace flag 1222.4. Open up the .xdl file in SSMS to show the deadlock graph
17 WASD and Query Performance Stick to the basics…..Running or waiting? Blocking or CPU?Is it your application, Windows Azure role, your computer, or queries?Is it network latency?Differences from when “good”? Did the query plan change?Proper indexes – Avoid scans, large sorts, ….Auto create and Auto update stats on by defaultThere are methods to optimize performance specific to AzureWindows Azure SQL Database and SQL Server -- Performance and Scalability Compared and ContrastedInevitably you may have to shard your data“Chatty” applications don’t usually perform wellAvoid large result setsApplication problems may show up earlier on this platform (Ex. Transaction keeping the log from being truncated)3 minsAnother example. Trying to return a lot of rows from the service back to “on-premise” may get frequently disconnected due to network issues
18 WASD Performance Scenarios Interesting Performance ScenariosOn-premise clients may see higher ASYNC_NETWORK_IO waitsSmall transactions may result in WRITELOG and SE_REPL* waitsDeadlocks (Msg 1205) just like the “box” – Use sys.event_log to debugTroubleshooting Query TimeoutsCould just be blockingTrace your queries so you know which one timed outExamine query plan and tune the query/indexes3 minsShow my top wait stats for my database where I fed it a ton of small INSERTs and tried to SELECT back a huge row set.WRITELOG – small avg wait timeSE_REPL_COMMIT_ACK – small avg wait timeASYNC_NETWORK_IO – larger than normal wait time
19 Dynamic Management Views (DMV) for Performance Find out currently running requests in your database. Use this to detect blockingsys.dm_exec_requestsFind out the performance of queries that have run in your database. Look here for worst performing queriessys.dm_exec_query_statsDisplay the query plan of a specific querysys.dm_exec_query_planAggregation history of waits – Some new for WASDOnly shows any wait_type with count > 0sys.dm_db_wait_statsCould indexes help query performance?“missing index DMVs”2 minsInclude pointer to this doc link
21 Demo Troubleshooting Query Performance on WASD 5 mins Run scripts slowest_queries.sql to show which query is slowest and what does plan say about it.Notice slowest query has 200+ seconds but low CPU time.If you now look at waits you see IO_COMPLETION, a sign of sort spils to diskIf you look at the plan you see a SORT operator here but we used cl index seek.Note that other queries in this list actually come from using the portal and management studioIf time show whoisusingio.sql to show biggest I/O usersNow show the performance information from the SQL Management Portal and see this query and the plan
22 Watch Out for These 2 mins Keep database copies for “user error” Be careful dropping servers and databases in portalDML may fail if no clustered index (temp tables excluded)DMVs are database scopedDatabases have RCSI on by default – tables can be largerDATETIME in all data centers is stored as UTC timeYou may not have access to objects that appear in catalog viewsNon-supported or partial supported commands/featuresSystem Views Unique to WASD2 mins
23 Before you contact support Check the Azure forums: MSDN or stackoverflowCheck the service dashboardIs it Windows Azure? On-premise problem?Have exact error message(s) availableHave TracingID availableDo you know the query?Do you have application retry logic?Give us the date and time of issue with “observed” timezoneIs this happening now or in the past?3 minsThis link has some suggestions on what to retry onWe can do RCA but….It can take some time and we maynot have enough history
24 ReferencesRetry Logic for Transient Failures in Windows Azure SQL DatabaseError Messages (Windows Azure SQL Database)Windows Azure SQL Database Performance and Elasticity GuideWindows Azure SQL Database Connection Managementsys.event_log documentationCSS SQL Escalation BlogTroubleshoot and Optimize Queries with Windows Azure SQL Database
26 The Troubleshooting Checklist Does the Windows Azure Portal work and list your databases?Is there a dashboard posting for an outage in your region?Does the SQL Management Portal work?Does SQL Server Management Studio work?Is there an internet provider issue?Is your firewall configuration correct?Is the problem Windows Azure vs WASD?Is there blocking?Are your queries and index tuned?Is this really an application retry issue?Governance, quotas, limits, and throttling are “part of this platform”Have you looked at Event Tables?