Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Biostatistics

Similar presentations


Presentation on theme: "Introduction to Biostatistics"— Presentation transcript:

1 Introduction to Biostatistics

2 Before we start Final SMME I exam: Entry test in Bioethics
Entry test in Biostats ______________________ 1 case for statistical analysis and interpretation 1 bioethical case for comment and discussion 1 theory question from the bioethics questionnaire

3 Outline Population vs sample Descriptive vs inferential statistics
Sampling methods Sample size calculation Levels of measurement Graphical summaries

4 Definition of biostatistics
The science of collecting, organizing, analyzing, interpreting and presenting data for the purpose of more effective decisions in clinical context. “Turning data into knowledge” (Patrick Heagerty)

5 Why do we need to use statistical methods?
To make strongest possible conclusion from limited amounts of data; To generalize from a particular set of data to a more general conclusion. What do we need to pay attention to? Bias; Probability. Statistics means never having to say you are certain!

6 Population vs Sample Population Parameters μ, σ, σ2
Sample / Statistics x, s, s2

7 Population vs Sample Population includes all objects of interest, whereas sample is only a portion of the population: Parameters are associated with populations and statistics with samples; Parameters are usually denoted using Greek letters (μ, σ), while statistics are usually denoted using Roman letters (X, s). There are several reasons why we do not work with populations: They are usually large and it is often impossible to get data for every object we are studying; The more items surveyed, the larger the cost.

8 Descriptive vs Inferential statistics
Population Parameters Sampling From population to sample Sample Statistics From sample to population Inferential statistics

9 Descriptive vs Inferential statistics
We compute statistics and use them to estimate parameters. The computation is the first part of the statistical analysis (Descriptive Statistics) and the estimation is the second part (Inferential Statistics). Descriptive Statistics: The procedure used to organize and summarize masses of data. Inferential Statistics: The methods used to find out something about a population, based on a sample.

10 Probability A measure of the likelihood that a particular event will happen. It is expressed by a value between 0 and 1. First, note that we talk about the probability of an event, but what we measure is the rate in a group. If we observe that 5 babies in every have congenital heart disease, we say that the probability of a (single) baby being affected is 5 in 1000 or 0.0 1.0 Cannot happen Sure to happen

11 Probability vs Statistics
General => Specific Population => Sample Model => Data Statistics Specific => General Sample => Population Data => Model

12 Sampling Individuals in the population vary from one another with respect to an outcome of interest:

13 Sampling When a sample is drawn there is no certainty that it will be representative for the population: Sample A Sample B

14 Sampling Sampling – a specific principle used to select members of population to be included in the study. Random error can be conceptualized as sampling variability. Bias (systematic error) is a difference between an observed value and the true value due to all causes other than sampling variability. Accuracy is a general term denoting the absence of error of all kinds.

15 Sampling Sample B Sample A Population

16 Sampling Sample B Sample A Population

17 Sampling Stages of sampling: Properties of a good sample:
Defining target population; Selecting a sampling method; Determining sampling size. Properties of a good sample: Random selection; Representativeness by structure; Representativeness by number of cases.

18 Sampling Non-probability sampling: Probability sampling:
Judgment (purposive) sampling; Convenience sampling; Snowball sampling; Quota (proportional) sampling. Probability sampling: Simple random sampling; Systematic sampling; Stratified sampling; Cluster sampling.

19 Advantages of probability sampling
Provides a quantitative measure of the extent of variation due to random effects Provides data of known quality Provides acceptable data at minimum cost Better control over non-sampling sources of errors Mathematical statistics and probability can be applied to analyze and interpret the data

20 Disadvantages of non-probability sampling
Bias unknown Selection bias very likely No mathematical property Provides false economy

21 Non-probability sampling
Judgment: Sample group members are selected on the basis of judgment of researcher. Time efficiency Samples are not highly representative Unscientific approach Personal bias Convenience: Obtaining participants conveniently with no requirements whatsoever. High levels of simplicity and ease Usefulness in pilot studies Highest level of sampling error Selection bias

22 Non-probability sampling
Snowball: Sample group members nominate additional members to participate in the study. Possibility to recruit hidden population Over-representation of a particular network Reluctance of sample group members to nominate additional members

23 Probability sampling Simple random sampling: Each element has an equal probability of being selected from a list of all population units (sample of n from N population). Highly effective if all subjects participate in data collection High level of sampling error when sample size is small Systematic sampling: Every Nth member of population is included in the study. Time efficient Cost efficient High sampling bias if periodicity exists

24 Simple random vs systematic sampling
Systematic sampling has many advantages: Provides a better random distribution than simple random sampling; Simple to implement; May be started without a complete listing frame (e.g., interview of every 9th patient coming to a clinic); With ordered list, the variance may be smaller than in simple random sampling. However: In systematic sampling, only the first unit is selected at random, the rest being selected according to a predetermined pattern; Systematic sampling is to be applied only if the given population is logically homogeneous; Simple random sampling is free of classification error and requires minimum advance knowledge of the population.

25 Stratified sampling The population is divided into smaller groups or strata to complete the sampling process. Strata are based on some common characteristics in the population data.

26 Stratified sampling Effective representation of all subgroups
Precise estimates in cases of homogeneity or heterogeneity within strata Knowledge of strata membership is required Complex to apply in practical levels Types of stratified sampling: Proportionate allocation uses a sampling fraction in each of the strata that is proportional to that of the total population. For instance, if the population consists of X total individuals, m of which are male and f female (and where m + f = X), then the relative size of the two samples (x1 = m/X males, x2 = f/X females) should reflect this proportion. Optimum allocation (or disproportionate allocation) uses a sampling fraction in each of the strata that is proportional to both the proportion that of the total population (as proportionate allocation) and to the standard deviation of the distribution of the variable. Larger samples are taken in the strata with the greatest variability to generate the least possible overall sampling variance.

27 Cluster sampling Used when mutually homogeneous yet internally heterogeneous groups are evident in the population. The population is divided into these groups and a simple random sample of groups is selected. The elements in each cluster are then sampled.

28 Cluster sampling Time and cost efficient
Group-level information needs to be known Usually higher sampling errors compared to alternative sampling methods Types of cluster sampling: If all elements in each sampled cluster are sampled, then this is referred to as a "one-stage" cluster sampling plan. If a simple random subsample of elements is selected within each of these groups, this is referred to as a "two-stage" cluster sampling plan.

29 Stratified vs cluster sampling
The main difference between cluster sampling and stratified sampling is that in cluster sampling the cluster is treated as the sampling unit so sampling is done on a population of clusters (at least in the first stage). In stratified sampling, the sampling is done on elements within each strata. In stratified sampling, a random sample is drawn from each of the strata, whereas in cluster sampling only the selected clusters are sampled. A common motivation of cluster sampling is to reduce costs by increasing sampling efficiency. This contrasts with stratified sampling where the motivation is to increase precision.

30 Sample size calculation
Law of Large Numbers: As the number of trials of a random process increases, the percentage difference between the expected and actual values goes to zero. Application in biostatistics: Bigger sample size, smaller margin of error. A properly designed study will include a justification for the number of experimental units being examined. Sample size calculations are necessary to design experiments that are large enough to produce useful information and small enough to be practical.

31 Sample size calculation
Provides validity of the clinical trials/intervention studies Assures that the intended study will have a desired power for correctly detecting a (clinically meaningful) difference of the study entity under study if such a difference truly exist Two objectives: Measure with a precision: Precision analysis Assure that the difference is correctly detected Power analysis

32 Sample size calculation
Generally, the sample size for any study depends on: Acceptable level of confidence; Expected effect size and absolute error of precision; Underlying scatter in the population; Power of the study. High power Large sample size Large effect Little scatter Low power Small sample size Small effect Lots of scatter

33 Sample size calculation
For quantitative variables: Z – confidence level; SD – standard deviation; d – absolute error of precision (margin of error).

34 Sample size calculation
Sources of variance information: Published studies (concerns: geographical, contextual, time issues – external validity) Previous studies Pilot studies Sample size estimation depends on the study design – as variance of an estimate depends on the study design.

35 Sample size calculation
For quantitative variables: A researcher is interested in knowing the average systolic blood pressure in pediatric age group at 95% level of confidence and precision of 5 mmHg. Standard deviation, based on previous studies, is 25 mmHg. => 97

36 Sample size calculation
For qualitative variables: Z – confidence level p – expected proportion in population d – absolute error of precision (margin of error)

37 Sample size calculation
For qualitative variables: A researcher is interested in knowing the proportion of diabetes patients having hypertension. According to a previous study, the actual number is no more than 15%. The researcher wants to calculate this size with a 5% absolute precision error and a 95% confidence level. => 196

38 When do you need biostatistics?
BEFORE you start your study! After that, it will be too late…

39 Planning Research program: Aim Object Units of observation
Indices of observation Place Time Statistical analyses Methodology

40 One vs Many Many measurements on one subject are not the same thing as one measurement on many subjects. With many measurements on one subject, you get to know the one subject quite well but you learn nothing about how the response varies across subjects. With one measurement on many subjects, you learn less about each individual, but you get a good sense of how the response varies across subjects.

41 Paired vs Unpaired Data are paired (related), when two or more measurements are made on the same observational unit (subjects, couples, and so on). Data are unpaired (unrelated), where only one type of measurement is made on each unit.

42 Data processing Data check and correction Data coding Data aggregation
According to the data usage: Primary Secondary According to the number of indices Simple Complex It is always a good idea to summarize your data (at least for important variables) You become familiar with the data and the characteristics of the sample that you are studying You can also identify problems with data collection or errors in the data (data management issues) Range checks for illogical values

43 Variables vs Data Mr. Smith Mrs. Johns Mrs. Oliver Age 36 43 56 Sex
A variable is something whose value can vary. Data are the values you get when you measure a variable. Mr. Smith Mrs. Johns Mrs. Oliver Age 36 43 56 Sex Male Female Blood type A

44 Quantitative (metric) variables
Continuous Measured units Metric continuous variables can be properly measured and have units of measurement. Continuous values on proper numeric line or scale Data are real numbers (located on the number line). Discrete Integer values on proper numeric line or scale Metric discrete variables can be properly counted and have units of measurement – ‘numbers of things’. Counted units

45 Qualitative (categorical) variables
Nominal Values in arbitrary categories Ordering of the categories is completely arbitrary. In other words, categories cannot be ordered in any meaningful way. No units! Data do not have any units of measurement. Ordinal Values in ordered categories Ordering of the categories is not arbitrary. It is now possible to order the categories in a meaningful way.

46 Levels of measurement There are four levels of measurement: Nominal, Ordinal, Interval, and Ratio. These go from lowest level to highest level. Data is classified according to the highest level which it fits. Each additional level adds something the previous level didn't have. Nominal is the lowest level. Only names are meaningful here. Ordinal adds an order to the names. Interval adds meaningful differences. Ratio adds a zero so that ratios are meaningful.

47 Levels of measurement Nominal scale – eg., genotype
You can code it with numbers, but the order is arbitrary and any calculations would be meaningless. Ordinal scale – eg., pain score from 1 to 10 The order matters but not the difference between values. Interval scale – eg., temperature in C0 The difference between two values is meaningful. Ratio scale – eg., height It has a clear definition of 0. When the variable equals 0, there is none of that variable. When working with ratio variables, but not interval variables, you can look at the ratio of two measurements.

48 Frequency distribution Mean, standard deviation
Variables Different types of data require different kind of analyses: Nominal Ordinal Interval Ratio Frequency distribution Yes Median, percentiles No Mean, standard deviation

49 Data processing Some visual ways to summarize data: Tables Graphs
Bar charts Histograms Box plots

50 Frequency table Elements Formal Title Main column Main row Legend
Logical

51 Number of Anti-HBs (+) cases
Frequency table Simple table Table 1. Anti-HBs (+) outcomes per group from a HBV screening study* Title Screened group Number of Anti-HBs (+) cases % Chilldren of 7 y. 3 10% Chilldren of 11 y. 7 23% Chilldren of 17 y. Romani people 1 3% Contacts in family Health professionals 13 43% Total 30 100% Main row Main column Legend * Part of TPTBHB Project

52 Frequency table Complex table (cross tabulation)
Table 2. HBV high-risk groups to be screened by residence* Smolyan Zlatograd Rudozem Subtotal Contacts in family 65 20 15 100 Health professionals 98 30 22 150 Romani people Total: 350 Residence Risk group * Part of TPTBHB Project

53 Graphical summaries Variable Graph Statistics One qualitative
Bar chart Pie chart Frequency table Relative frequency table Proportion Two qualitative Side-by-side bar chart Segmented bar chart Two-way table Difference in proportions One quantitative Dotplot Histogram Boxplot Measures of central tendency Measures of spread Other: five number summary, percentiles, distribution shape One quantitative by one qualitative Side-by-side boxplots Stacked dotplots Statistics broken down by group Difference in means Two quantitative Scatterplot Correlation

54 Bar chart Bar chart is a way to visually represent qualitative data.
Data is displayed either horizontally or vertically and allows viewers to compare items, such as amounts, characteristics, and frequency. Bars are arranged in order of frequency, so more important categories are emphasized. Bar charts can be either single, stacked, or grouped.

55 Pie chart Pie chart is helpful when graphing qualitative data, where the information describes a trait or attribute and is not numerical. Each slice of pie represents a different category, and each trait corresponds to a different slice of the pie—with some slices usually noticeably larger than others.

56 Histogram A histogram is used with quantitative data. Ranges of values, called classes, are listed at the bottom, and the classes with greater frequencies have taller bars. A histogram often looks similar to a bar chart, but they are different because of the level of measurement of the data: A bar chart is for categorical data, and the x-axis has no numeric scale A histogram is for quantitative data, and the x-axis is numeric.

57 Boxplot Boxplot is a method for graphically depicting groups of numerical data through their quartiles.


Download ppt "Introduction to Biostatistics"

Similar presentations


Ads by Google