Presentation is loading. Please wait.

Presentation is loading. Please wait.

GROUP BY & Subset Data Analysis

Similar presentations


Presentation on theme: "GROUP BY & Subset Data Analysis"— Presentation transcript:

1 GROUP BY & Subset Data Analysis
Farrokh Alemi, Ph.D. This section provides a brief introduction to the GROUP BY command within SQL and shows how it can be used to create summaries of data. This brief presentation was organized by Dr. Alemi. It was narrated by xxx

2 2nd 1st 3rd One Value Reported Cross Join Purpose
The GROUP BY command tells the software to summarize the values in a column for subsets of data. If several values are reported within the subset, the GROUP BY command reports only one value per subset.

3 SELECT expression1, expression2, ... expression_n,
Cross Join SELECT expression1, expression2, ... expression_n, aggregate_function (other_expressions) FROM tables [WHERE conditions] GROUP BY expression1, expression2, ... expression_n [ORDER BY aggregate_function (expression) [ ASC | DESC ]]; Syntax The syntax of the GROUP BY command is given in this slide.

4 SELECT expression1, expression2, ... expression_n,
aggregate_function (other_expressions) FROM tables [WHERE conditions] GROUP BY expression1, expression2, ... expression_n [ORDER BY aggregate_function (expression) [ ASC | DESC ]]; Syntax Any fields, or expressions of fields, must either be listed in the GROUP BY command or encapsulated within an aggregate function in the SELECT portion of the command.

5 AVG SELECT expression1, expression2, ... expression_n,
aggregate_function (other_expressions) FROM tables [WHERE conditions] GROUP BY expression1, expression2, ... expression_n [ORDER BY aggregate_function (expression) [ ASC | DESC ]]; Aggregate Functions Aggregate functions include AVG, where in all records in the subset of data are averaged. AVG

6 STDEV SELECT expression1, expression2, ... expression_n,
aggregate_function (other_expressions) FROM tables [WHERE conditions] GROUP BY expression1, expression2, ... expression_n [ORDER BY aggregate_function (expression) [ ASC | DESC ]]; Aggregate Functions It includes STDEV, where the standard deviation of all records in the subset of data are calculated. STDEV

7 COUNT SELECT expression1, expression2, ... expression_n,
aggregate_function (other_expressions) FROM tables [WHERE conditions] GROUP BY expression1, expression2, ... expression_n [ORDER BY aggregate_function (expression) [ ASC | DESC ]]; Aggregate Functions A common aggregate function is COUNT, where all values in the subset are counted. The COUNTIF counts a value if it meets a logical test. COUNT(DISTINCT, Field) calculates distinct values in the field. COUNT

8 MAX or MIN SELECT expression1, expression2, ... expression_n,
aggregate_function (other_expressions) FROM tables [WHERE conditions] GROUP BY expression1, expression2, ... expression_n [ORDER BY aggregate_function (expression) [ ASC | DESC ]]; Aggregate Functions Finally MAX and MIN functions select the maximum or minimum value for the subset of data. Maximum of a numerical field gives the largest number in the subset. Maximum of a date field will selects the most recent value. Minimum of a date field selects the first date in our subset. MAX or MIN

9 SELECT expression1, expression2, ... expression_n,
aggregate_function (other_expressions) FROM tables [WHERE conditions] GROUP BY expression1, expression2, ... expression_n [ORDER BY aggregate_function (expression) [ ASC | DESC ]]; Optional Portion The WHERE and ORDER BY commands are optional.

10 SELECT expression1, expression2, ... expression_n,
aggregate_function (other_expressions) FROM tables [WHERE conditions] GROUP BY expression1, expression2, ... expression_n [ORDER BY aggregate_function (expression) [ ASC | DESC ]]; Optional Portion The ORDER BY command lists the data in a particular ascending or descending order of a set of fields.

11 SELECT expression1, expression2, ... expression_n,
aggregate_function (other_expressions) FROM tables [WHERE conditions] GROUP BY expression1, expression2, ... expression_n [ORDER BY aggregate_function (expression) [ ASC | DESC ]]; Optional Portion The WHERE command restricts the data to the situation where the stated condition has been met.

12 SELECT top 10 ID, Count(distinct icd9) AS CountDx FROM dbo.final
USE AgeDx SELECT top 10 ID, Count(distinct icd9) AS CountDx FROM dbo.final WHERE AgeAtDeath is null GROUP BY ID ORDER BY Count(distinct icd9) desc; Example The code snippet shows an example of use of GROUP BY command. The code reports the number of distinct diagnoses for patients who have not died.

13 SELECT top 10 ID, Count(distinct icd9) AS CountDx FROM dbo.final
USE AgeDx SELECT top 10 ID, Count(distinct icd9) AS CountDx FROM dbo.final WHERE AgeAtDeath is null GROUP BY ID ORDER BY Count(distinct icd9) desc; Example In FROM and USE parts, the code specifies that the table “final” from database AgeDx should be used.

14 SELECT top 10 ID, Count(distinct icd9) AS CountDx FROM dbo.final
USE AgeDx SELECT top 10 ID, Count(distinct icd9) AS CountDx FROM dbo.final WHERE AgeAtDeath is null GROUP BY ID ORDER BY Count(distinct icd9) desc; Example The GROUP BY command tells it to do separate analysis for each patient. Since there are several records available for each patient ID, the GROUP BY command tells the computer to return only one value per patient.

15 SELECT top 10 ID, Count(distinct icd9) AS CountDx FROM dbo.final
USE AgeDx SELECT top 10 ID, Count(distinct icd9) AS CountDx FROM dbo.final WHERE AgeAtDeath is null GROUP BY ID ORDER BY Count(distinct icd9) desc; Example In the SELECT portion of the code, ID is listed without an aggregate function because it is already part of the GROUP BY command. The ICD9 code is not in the GROUP BY command so it must be listed with an aggregate function, in this case the count function. The same holds true for fields listed in the ORDER BY command.

16 SELECT top 10 ID, Count(distinct icd9) AS CountDx FROM dbo.final
USE AgeDx SELECT top 10 ID, Count(distinct icd9) AS CountDx FROM dbo.final WHERE AgeAtDeath is null GROUP BY ID ORDER BY Count(distinct icd9) desc; Example The COUNT command tells the computer to report number of distinct entries in the field ICD9.

17 SELECT top 10 ID, Count(distinct icd9) AS CountDx FROM dbo.final
USE AgeDx SELECT top 10 ID, Count(distinct icd9) AS CountDx FROM dbo.final WHERE AgeAtDeath is null GROUP BY ID ORDER BY Count(distinct icd9) desc; Example The WHERE command tells the computer to focus on alive patients. Note that variables in the WHERE portion of the code do not need to be encapsulated in aggregate function. The WHERE command is executed before the GROUP BY command. In large data, the use of WHERE command can make GROUP BY computations much faster.

18 ID CountDx Resulting Data The slide reports the resulting data. Each ID is followed by the count of the patient’s distinct diagnoses. For ID 134,748 there were 195 distinct diagnoses. Seems a lot but we need to see over what timeframe.

19 Summarize All Summarize One WHERE is an Exception Except for WHERE
If you summarize one field in your query, all listed fields must be summarized. So if you do not need to summarize some fields, just do not include those fields in the query.

20 The GROUP BY command summarizes the fields for subsets of data.


Download ppt "GROUP BY & Subset Data Analysis"

Similar presentations


Ads by Google