Download presentation
Presentation is loading. Please wait.
1
DLI Orientation: Concepts
A Framework for Thinking about Data and Statistics Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008
2
Outline Data and statistics: what are we talking about?
Key concepts for data and statistics Statistics are about definitions Framework for numeric information DLI and standard data products E-tables and databases Aggregate data Public use microdata Spatial data Continuum of access Levels of service
3
What are we talking about?
4
Numeric information Statistics Data numeric facts/figures
created from data, i.e, already processed presentation-ready Data numeric files created and organized for analysis/processing requires processing not display-ready I have divided numeric information into two general categories: statistics and data. This is an important distinction. Far too often, you will find people using the terms “statistics” and “data” interchangeably. But we won’t. They have very different meanings and purposes. First, statistics are the compilations of numeric facts and figures that are readily available in print and electronic format. These facts and figures are created from data and are organized for human consumption. That is, they are display- ready and appear in tables, charts, graphs or maps. Data, on the other hand, is the raw information from which statistics are derived. Data are stored in computer files and organized in structures that support their analysis or processing. Unlike statistics, which have been organized for presentation, data are not display-ready. The examples here show a table representing the number of smokers in Canada and the a few records upon which this table was built. Let’s look at these two items in a bit more detail.
5
Statistics Geography Region Time Periods
Unit of Observation Attributes Smokers Education Age Sex Six dimensions or variables in this table Let’s look at the table shown two slides earlier in more detail. First, it is being viewed in the Beyond 20/20 table browser, a proprietary program first introduced by Statistics Canada with the 1996 Canadian Census. [How many of you are familiar with Beyond 20/20? Let’s have a show of hands.] We’ll come back to Beyond 20/20 if you have questions about it later in the workshop. This is a very helpful tool and can be used organize the display of statistical information. For the time being, notice the following characteristics of this table. First, the numbers in the cells of the table represent estimated counts of the number of smokers. Along the rows of the table are standard geographic areas for Canada: a national total and the ten provinces. The territories are missing from this table. Why do think this is? The columns are organized around two year groupings. First, there is a summary for ; secondly, there is another grouping for Why do you think this is? Under each of the yearly groupings is a breakdown for males and females. At the top of the table, we see two more characteristics that have been held constant: education and age. This table could be displayed in different ways. One defining property of statistical tables is that almost all are structured around three basic concepts: geography, time and attributes of the unit of observation, i.e., the object that was observed in collecting the data. In this case, geography is represented by Region; time is represented by the two time periods; and the attributes of individuals are smoking behaviour, education level, age and sex. The cells in the table are the number of estimated smokers.
6
Defining some key concepts
Statistics are based on a few key underlying concepts and knowing the definition of these concepts is useful in interpreting statistics. Referring to the previous slide, what does Statistics Canada mean by “geography”? Statistics Canada uses the concept of location to describe geography. “The concept of location is that of a physical place where the activity of a statistical unit occurs and for which data are collected.” Knowing what statistics actually represent is dependent on the definition of the concepts underlying the statistics. The link in this slide takes us to the Statistics Canada website where the definition of geography as it is used in statistical summaries is defined.
7
Concept: unit of observation
The concept of location refers to “statistical units.” These are the units of observation for which data are collected and that statistics describe or summarize. Statistical units for business surveys include the enterprise, the company, the establishment and the location. Statistical units for social surveys include the census family, the economic family and the household. “There are two primary sources for social statistics: one is administrative records, which generally collect information from the files of individuals; the other is from censuses and surveys where the unit of observation is the household and individuals within the household.”
8
Concept: universe Universe describes characteristics of the unit of observation used in the selection of those from whom data are collected. This concept is closely associated with the sample design employed in selecting members of the unit of observation. The universe includes all members of the unit of observation, while the sample consists of just those members from whom data are collected. Statistics Canada uses “target population” to describe each survey’s universe.
9
Concept: sample weights
With the exception of some administrative databases, Statistics Canada employs probabilistic sampling methods to select members in the unit of observation from its universe. Typically, not every member in the unit of observation has the same probability of being selected. Consequently, Statistics Canada determines a sample weight that it includes with the data file to correct for the sample design and to provide population estimates.
10
Unit of observation & universe
Together, the unit of observation and the universe describe the objects from whom data are collected and to whom generalizations and descriptions are being made in statistical displays. All statistical tables are based on a specific unit of observation. Because table headings don’t explicitly say, “the unit of observation is,” one is left to interpret this information from the table. With well designed tables, the unit of observation should be obvious. Let’s look at the characteristics of a well defined table.
11
Title Unit of Observation Universe Variables Statistical Metric
Average Tuition Discipline Academic Year Province Statistical Metric Dollars Let’s see how many of the metadata attributes we can find in this table. Footnote Date Producer
12
Statistics are about definitions
Statistics are about definitions! And you thought that were just about numbers. Statistics are numeric summaries that are dependent on the definitions of the attributes of the unit of observation, which we’ll discuss shortly. An essential step in creating statistics is organizing the data into meaningful categories or summary measures that help tell a story. Fundamental to this is defining the categories or measures used to create the statistics. You can clearly see that the point-haired manager in Dilbert clearly understands that statistics are about definitions. He didn’t get the industry salary averages that he wanted. So, he changed the definition of his industry in include high technology, textile workers, teen-agers and dead people.
13
Statistics are about definitions
Each characteristic or variable that is measured or recorded about the unit of observation must be clearly defined. Statistics Canada has definitions for some of the more frequently used concepts and variables on its website under “Definitions, data sources and methods.” The Census Dictionary is an important source for definitions of the concepts and variables in each Census.
14
Definitions use classifications
The definitions for concepts and variables use classification systems to assign categories or values to the properties of the concepts. For example, Region in this table consists of Canada and the ten provinces. The definition of concepts involves the use of classifications to assign categories to the properties of the concept.
15
Definitions use classifications
Some classifications are based on standards while others are based on convention or practice. For example, Standard Geography classifications National statistical agencies tend to introduce standards for their classification systems. This creates consistency in their reporting and allows for comparisons. Many of this national agencies meet and agree upon definitons and classifications so that international comparisons can occur. The link in this slide shows the standard geographic classification system used in this table.
16
Classifications involve categories
Sex Total Male Female Periods In this table, categories are shown for the sex of individuals, the time periods used, the geography used, smoking status, how education is defined and how age is treated. Let’s look at the time periods again. Why this particular year combinations?
17
Definitions and metadata
All of the definitions and information that describe the unit of observation, the universe, the sampling method, the concepts and the variables are critical to understand both the data and the statistics derived from the data. We use to talk about codebooks and about the User’s Guide and Data Dictionary when speaking of data documentation. Now we refer to this documentation as metadata, which has been expanded to include documentation throughout the life cycle of a survey. The Data Documentation Initiative 3.0 standard is being used to organize this information.
18
Authentic statistics are derived from data and as such, the data sources for statistics should be reported with the statistics. The metadata that accompanied the previous table shows that the statistics were derived from the National Population Health Survey for two waves: and Now we can understand why the table was organized using these particular yearly groupings. This reflects the periods when the survey was conducted. In fact, it appears that more years are available in the table than were shown previously.
19
Unit of observation and data
The unit of observation also defines an important structural characteristic of data files. A record in a data file represents the information for one member of the unit of observation.
20
Data This slide shows the microdata (i.e., raw data) for four cases and part of a fifth. The red box encapsulates one respondent from the National Population Health Survey. All of the digits in this record represent the responses to questionnaire items. Somewhere on each record codes exist for the sex of the respondent, her/his age, education, province and whether she or he smokes. Clearly, this information requires metadata to translate where these variables exist on each record. Furthermore, it is clear that this information is intended for computer processing and not human processing. One would not try to read this file to determine the number of female smokers, for example.
21
Stories are told through statistics
The National Population Survey in the previous example had over 80,000 respondents in sample and the Canadian Community Health Survey in 2005 has over 130,000 cases. How do we tell the stories about each of these respondents? We create summaries of these life experiences using statistics. You’ve just seen an example of the records contained in the National Population Health Survey. In the survey, there were over 80,000 respondents. They provide a rich description of the status of population health characteristics in Canada, each one having a story to tell. The Canadian Community Health Survey in 2005 has over 130,000 stories to tell. The challenge is: How do you tell this many stories? Are they all unique or are there commonalities in the stories to be told? The answer is that we use statistics to summarize life experiences. We tell the stories in the aggregate and not at the level of each individual.
22
Summary Statistics are derived from observational, experimental or simulated data . A table is a format for displaying statistics and presents a summary or one view of the data. Tables are structured around geography, time and attributes of the unit of observation. Statistics are dependent on definitions and classification systems. Statistics summarize individual stories into common or general stories. Let’s summarize this discussion about the distinctions between statistics and data.
23
Framework for Numeric Information
Using the distinction between statistics and data, numeric information can be subclassified into dissemination products. The discussion about official and non-official statistics is for another workshop.
24
Where does DLI fit in this scheme?
Numeric Information Standard Data Products Where do DLI products fit into this scheme? Under a grouping identified in the DLI licence as “standard data products,” which includes selected databases with aggregate data, selected standard aggregate tables and selected microdata.
25
DLI and standard data products
DLI licence, article 1: “via the Data Liberation Initiative (DLI), Statistics Canada will offer my educational institution, timely access, on a subscription basis, to standard Statistics Canada data products, such as public use microdata files (non-identifiable datasets containing characteristics pertaining to surveyed units), standard files and databases (containing aggregate data as defined and determined by Statistics Canada) and geography files, in available electronic formats.” Standard data products use to be all products for sale in the Online Catalouge. Now Statistics Canada refers to standard electronic products, which includes also e-publications and e-tables, some of which are now free. Specifically, the licence mentions public use microdata files, standard files and databases and geography files. It use to be that all products in the Online Catalogue for sale made up standard data products. Now the Online Catalogue contains e-publications and e-tables, not all of which are for sale. Click on the link to the Online Catalogue. Search for “microdata”. On April 19th, there were 147 results. In the left-hand menu, click on Price (free or $). Notice that 43 are free and 104 have a cost. Click on the “Free” link. Look at the first three entries: IALS, Adult Literacy and CCHS. What is free? Click on the Price link and select Cost. How much does the latest public use file for the Canada Survey of Giving, Volunteering and Participating cost? Is this available through the DLI?
26
Standard Electronic Products
Dissemination policy In 2004, Statistics Canada introduced a new policy stipulating that all standard electronic products will be available either through the Depository Services Program (DSP) or DLI. This means that libraries in the academic community belonging to both the DSP and DLI should have access to all standard electronic products. The new policy in 2004 designated one of two places for all outputs identified by author divisions for dissemination. If the product was an electronic publication (or in some instances a compendium of e-tables), it was deposited with the DSP. If the product was an aggregate database or pubic use microdata, it was deposited with the DLI. Go to the DLI website and search for microdata. It’s a work in progress. Another strategy for identifying products is go to the description under each survey. Go Home; Definitions, data sources and methods; List by subject; Health; Measures of health; 3226 CCHS; search for “public use”; also check the “Links to related products.” Standard Electronic Products (electronic)
27
Standard product definitions
This next section provides definitions for e-tables, databases, aggregate data and public use microdata and presents some examples of each. E-tables: these are tables in an electronic dissemination format (e.g., Beyond 2020 or Excel). Tables are displays for presenting the statistical results of a data analysis and provide one view of the data expressed through the selection of variables representing geography, time and attributes of the unit of observation.
28
DLI e-table examples The Canadian Centre for Justice Statistics Beyond 2020 tables Only tables in Beyond 2020 format; no public use microdata Survey of Household Spending Excel tables and a public use microdata file
29
Databases and aggregate data
Databases consist of file structures for storing aggregate data that can be viewed as either e-tables or retrieved as aggregate data. For example, CANSIM (a large database of time series) can be used for either purpose. Aggregate data consist of statistics that are organized in a data structure and stored in a database or in a data file. These files are used for input into statistical analysis software.
30
Databases and aggregate data
The data structure of an aggregate file is based on tabulations organized by one or more of these factors: time, geography, or social characteristics.
31
Time series aggregate data
Time series: each line of the data file represents tabulations for a specific period of time. For example, a file of annual statistics from 1976 to 2005 would have 30 lines, one line for each year.
32
Geo-referenced aggregate data
Geo-referenced data: each line of the data file represents a spatial unit within which summary statistics have been tabulated. The spatial unit to which each line of data is associated is identified through a geo-code. Using Beyond 2020, Census basic tabulations and profile series can be output for use with GIS software. Correspondingly, Census boundary files are available through DLI that use codes from the Standard Geographic Classification system. A Postal Code Conversion File (PCCF) exists to locate postal codes within Census geography.
33
An Example from E-STAT
34
Geo-referenced aggregate data
“Small area statistics” are a special category of aggregate data. These data files consist of statistics for small geographic areas usually calculated from a population or manufacturing census or an administrative database with enough cases to create accurate summaries for small areas.
35
Cross-classified aggregate data
Aggregate data, where each line in the file represents characteristics of the unit of observation, are also known as “cross-classified” tables. These data are often analyzed in the absence of a public use microdata product. For example, no public use microdata exit for vital statistics. Consequently, the cross-classified data for age and sex by cause of death is an important data source for researchers.
36
Cross-classified aggregate data
37
Microdata This is raw data organized in a file where the lines in the file represent a specific unit of observation and the information on the lines are the values of variables. There are different types of microdata files: master files, share files, public use files and synthetic files.
38
Confidential microdata
Master files: these files contain the fullness of detail captured about each case of the unit of observation. This detail is specific enough that the identify of a case can often be disclosed easily. Therefore, these files are treated as confidential. Master files from the social data in Statistics Canada are available to the research community through the Research Data Centre Network.
39
Confidential microdata
Share files: these are confidential files in which the participants in the survey have signed a consent form permitting Statistics Canada to allow access to their information for approved research. These files consist of a subset of the cases in the master file. Access to share files may be granted to specific government departments without the need for their researchers to work within a Research Data Centre.
40
Public use microdata These microdata are specially prepared to minimize the possibility of disclosing or identifying any of the individuals in the file. The original data from the master file are edited to create a public use microdata file. Public use microdata files are only available for select social surveys that undergo a review of the Data Release Committee, an internal Statistics Canada committee. There are no ‘enterprise’ public use microdata files.
41
Public use microdata Steps in anonymizing microdata
Remove of all personal identification information (names, addresses, etc); Include only gross levels of geography; Collapse detailed information into a smaller number of general categories; Cap the upper range of values of variables with rare cases; Suppress the values of a variable; or Suppress entire cases.
42
Public use microdata Almost all public use microdata files are derived using cross-sectional samples, that is, samples where the data have been collected from respondents at one point in time. Longitudinal samples, where data are collected from the same individuals two or more times, are difficult to anonymize and maintain any useful information.
43
Synthetic microdata In an attempt to provide the research community with a version of the microdata that is like the master file but does not contain real cases, some author divisions are exploring the use of synthetic microdata files. Theoretically, these files return results close to the real data in the master file without the risk of disclosure. Synthetic files are different from “dummy” files which have no data but rather have only the variable structure to allow the testing of syntax.
44
Spatial data Statistics Canada provides spatial data files for each the different geographic levels in which it disseminates Census results. These files are available as digital boundary files or cartographic boundary files. Digital Boundary Files depict the full extent of the geographical areas and extend into bodies of water. Cartographic Boundary Files depict the geographical areas using only the major land mass of Canada and its coastal islands. These files are only available on the DLI FTP site.
45
Continuum of access It is one thing to know about the variety of Statistics Canada products that exists, but access to this information is a separate issue. The following model describes the various dissemination channels through which Statistics Canada provides access.
46
Continuum of access Think of the variety of channels as constituting a continuum along which levels of access are provided. There are three characteristics that make up this continuum: Cost : which runs from free to expensive; Restrictions or conditions : which run from open or no restrictions to very restricted; and Type of Information : which runs from statistics to data. You can think about access as flowing along a continuum. Three properties align the channels of access on this continuum. First there is the cost of the information, which runs from free to expensive. Next there are the conditions or restrictions upon the use of the information. The more the restriction the less the access. And finally, this combination of free, unconditional access aligns itself with our distinction between statistics and data. Statistics tend to be free and open; confidential data tend to be highly restricted and the cost to access is usually very expensive.
47
Continuum of access Continuum of Access
48
STC continuum of access
ACCESS CHANNELS Open Free Statistics Restricted Expensive Data Statistics Canada Website Research Data Centres Data Liberation Initiative Depository Service Program Remote Job Submission Custom Tabulations
49
Levels of data service There are several models for organizing local services to support DLI materials. Thinking of these models in terms of levels of service is helpful in identifying a model appropriate to your institution’s resources and priorities.
50
Levels of data service Retrieve data upon request and pass directly to patron. May or may not catalogue DLI titles. Subscribe to a data extraction service and offer as part of your electronic resources. May or may not catalogue titles. Integrate into access services, include DLI in electronic resources, your catalogue and your website. Add reference services to help patrons find data. Add data consulting services (help with manipulating and formatting data) to both access and reference services.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.