Presentation on theme: "Aggregate Data and Statistics"— Presentation transcript:
1Aggregate Data and Statistics Wendy Watkins Carleton UniversityChuck Humphrey University of AlbertaTitle pageGreetings: Good morning everyone, glad to be with you for this workshop.If you have questions through out our presentation please just ask and we can elaborate more on these questions you have.Statistics Canada Data Liberation InitiativeCAPDU/DLI Training May 29th, 2002
2Outline What are aggregate data? Why aggregate? How to aggregate? Computing exercise
3What are aggregate data? Let’s start with the relationship between statistics and data.
4Statistics and Data Data Statistics numeric facts/figures numeric files created and organized for analysisrequires processingnot ready for displayStatisticsnumeric facts/figurescreated from data, i.e, already processedpresentation-ready
8Statistics and DataIn short, statistics are created from data and represent summaries of the detail observed in the data.
9What is aggregation?Building on this previous example, let’s explore aggregation.We see a table with the number of smokers summarized over categories for age, education, sex, geography, and different time points.
10Categories of PeriodsA StatisticCategories of SexCategories of Region
11What is aggregation?Aggregation involves tabulating a summary statistic across all of the categories or levels of a set of variables.
12The summary statisticThe summary statistic in this example is the total number of smokers.
13Variables and categories The variables and their categories are:Region (11): Canada and the ten provincesAge (5) : Total, 15-19, 20-44, 45-64, 65+Sex (3) : Total, Female, MaleEducation (4) : Total, Some secondary or less, Secondary graduate or more, Not statedPeriods (5) : 1985, 1989, 1991, ,
14Variables and categories The tabulation consists of determining the combinations of all categories across variables and then counting the number of smokers within each of these combinations.11 x 5 x 3 x 4 x 5 = 3300 categorycombinations
15Tabulating or aggregating One might be wondering if there is a difference between tabulating and aggregating?Usually, they are the same thing.
16Tabulating = aggregating In creating tables from data, the variables are arranged in various combinations along the columns and the rows.
17Tabulating = aggregating Placing multiple variables along the columns or rows is called nesting.Tables may have variables nested on both the columns and rows.
19Categories of Education nested within Sex Categories of Sex nested within Region
20A quick summary Up to this point, we have noted that statistics are created from dataaggregations consist of tabulating statistics within the categories of select variablesvariables may be nested within columns and rows to display these tabulations
21What are aggregate data? What is the difference between a tabulation or aggregation and aggregate data?The display of the aggregation, that is, the structure of the tabulated output.
22What are aggregate data? A statistical data structure is a fixed, two-dimensional matrix with the variables in the columns and cases in the rows.V1V2V3V4V5V6V7Case 1Case 2Case 3Case 4Case 5Case 6Case 7
23What are aggregate data? Aggregate data require the same type of statistical data structure.Consequently, aggregate data are a special type of tabulation where variables are nested along the rows but not along the columns.
25Aggregate Data Structure To create an aggregate data structure for the example tabulation, the combination of categories representing geography (region), three social variables (age, sex, and education), and time (period) must all be nested along the rows, as shown in the previous slide.
26Another exampleThis time the table consists of the average length of stay in hospital by sex, age, diagnostic chapter, region, and time period.
31Aggregate average length of hospital stays in days The aggregate structure is represented by the 124,488 cells created by the combination of all categories from these five variables.The statistic is the average length of stay in the hospital in days.
32What are aggregate data? Definition: Statistical summaries over categorical variables representing social phenomena, geography, and time that are organized in a specific data structure.
33Time series aggregate data When the data structure of the summaries is organized around time, these aggregate statistics are called a time series.
39Why aggregate?Statistics Canada creates aggregate statistics from its major surveys, including the Census, as a way of publishing selected findings.The release of aggregate statistics is a partial safeguard against the possible disclosure of respondents.
40Why aggregate?Furthermore, the geographic distribution of statistics in Canada is important. As a result, aggregate statistics are released by Statistics Canada for different levels of geography – from the nation to small areas.
41Why aggregate?Statistics organized into time series is another way in which Statistics Canada publishes a large amount of statistical information. These time series reflect summaries of data that are repeatedly collected over time and permit studies about trends and change.
42Why aggregate? To publish findings To safeguard against disclosure To provide geographic distributions of statisticsTo present statistics over time
43Why aggregate? Other reasons to aggregate To modify geo-referenced statistics for GIS applicationsfor example, finding postal codes within their corresponding EA and then aggregating data from the postal code level up to the EA level
44Why aggregate? Other reasons to aggregate To change the unit of analysisfor the purposes of a specific research questionto create a common, higher-level unit of analysis that can be used in merging files
45How does one aggregate?Identify the grouping structure that represents all of the variables and their categories over which the aggregation is to be conducted.This group structure defines a new unit of analysis.
46How does one aggregate?Establish the sort order for the grouping variables, i.e., decide which variable increments the fastest, the next fastest, until you reach the variable that changes the slowest.Select the summary statistics, such as sums, averages, minimums, maximums, etc.
47How does one aggregate?The actual aggregation is performed using statistical software such as SAS or SPSS.SAS offers a couple of different procedures and the Data step that can be used to aggregate data, including Proc Summary, Proc Tabulate, and Proc Means.