Collecting Data.

Presentation on theme: "Collecting Data."— Presentation transcript:

Collecting Data

The meaning of statistics
Statistics is a way to answer questions using information. The information has to be observed or collected, ordered, represented and then analysed. The information that is collected is called raw data, and the data that you collected would depend on a question you are trying to answer, and this question is called a hypothesis. A hypothesis is an assumption that may or may not be true that is the starting point of an investigation. An example of this is that you want to answer the question “How does the price of a car change as it gets older?” You could use the hypothesis “As a car gets older, the price goes down.” The data needed for this question would need to be the age and price of cars.

Types of data Qualitative variables are non-numerical observations.
Quantitative variables are numerical observations or measurements. Quantitative variables can be continuous or discrete. Height, weight, response time, subjective rating of pain, temperature, and score on an exam are all examples of quantitative variables. Qualitative variables are non-numerical observations. Also known as categorical variables, qualitative variables are variables with no natural sense of ordering. They are therefore measured on a nominal scale. For example, hair color is a qualitative variable, as are names. Qualitative variables can be coded to appear numeric but their numbers are meaningless, as in male-1 and female-2.

Continuous and discrete data
Continuous data can take any value on a continuous numerical scale. The length of a piece of string could take any value on this scale, so is continuous data. Discrete data can only take particular values on a continuous numerical scale. The number of eggs laid by a chicken can only take particular values, so is discrete data.

Summary

Categorical data To make raw data easier to handle and easier to display they may be gathered or ordered in a particular way. Categorical data is an example of this. A set of data is categorical if values or observations belonging to it can be sorted into different categories. Each piece of categorical data is put into one of a set of non-overlapping categories. Shoes can be sorted according to colour, The characteristic colour can have non- overlapping categories. (e.g. black, brown, blue etc.) Numerical data can be put into categories. The characteristic length of thumb can have non-overlapping categories: x ≤ 30mm, 30mm < x ≤ 40mm, 40mm < x ≤ 50mm, etc.

Ranked data Ranked data has values/observations that can be ranked (put in order) or have a rating scale attached. Ranked data can be counted and ordered, but not measured. The categories for a ranked set of data have a natural order. Example: If judges ranked ten dogs on a scale of 1 to 10, 1 would represent the best of the dogs and 10 would be the worst. 1 2 3

Bivariate data Definition:
Bivariate data are pairs of related variables. Here are some examples of bivariate data pairs: Age and weight of a person Price and age of an object Shoe size and height Weight and fitness level Popularity and cost of an object

Why Use Bivariate Data Pairs?
We use bivariate data pairs in research and analysis for the following reasons: To determine weather there is a relationship between two variables To analyse how a change in one variable will affect the other and vice versa Sometimes bivariate data pairs can simplify an analysis

For Example… 2 students were trying to discover weather there was a relationship between shoe size and age in girls, they asked 3 friends and got these results: This shows that usually as age (the first variable) increases so does the second variable (shoe size). To investigate this pattern further they would interview more people

How Accurate Is Continuous Data?
Continuous data is not always 100% accurate just because it is measured in numerical units. This is because continuous data is measured in the kind of units that are usually rounded; age (rounded to nearest year) Distance (nearest m,cm,km,mile etc.) Time (nearest minute, hour, day, week etc.)

Rounding Continuous Data
Usually we round to the nearest 10,100,1000 depending on what is being recorded or measured. or we round to the nearest If the last number is 5 or above we round up, if its 5 or below we round down. I.e.) if someone is cm tall we could say they were 151 cm tall. Unfortunately this means continuous data is often inaccurate by around half a unit.

Accuracy Of age In Continuous Data
Age isn't usually that accurate because it is usually always given as the age a person was on their last birthday which means it could be wrong by up to 364 days.

For Example… If Jenny sits an exam it might ask for her age. If her last birthday was her 15th birthday she would have to write down 15 even if her 16th birthday was the next day. She would be a lot closer to 16 than 15 but we round age to the nearest year. This could make the analysis inaccurate especially if in was a bivariate variable.

Populations and sampling
You must identify the population in an investigation. Population simply means everything or everybody that could be involved in an investigation. Census data contains information about every member of the population. Sample data is usually used when the population is too large to survey all members. So a small but carefully chosen sample can be used to represent the population. Sample Data contains information about parts of the population. Advantages Disadvantages Census Unbiased, accurate, takes the Time-consuming, expensive, lots whole population into account data to handle Sample Cheaper, less time-consuming, Not completely representative, less data to be considered may be biased

Sampling Sampling Units are the people or items in the population that is to be sampled. A Sampling Frame is when the population is formed into a list that are to be sampled. When doing a sample, there are two questions to be asked: How big does the sample need to be? How is a sample taken so that is represents the population accurately? The sample size can change, however, the larger the sample, the more reliable the results are.

RANDOM SAMPLING To represent the population accurately, the sample should be taken randomly so that is it free from be biased. Bias can occur is many ways, such as: Using an unrepresentative sample Poor or misleading questions External factors affecting the data collection Not correctly identifying the whole population A random sample is a sample that is taken without a conscious decision being made of which population are to selected. Random sampling methods includes simple random sampling and stratified sampling.

Simple random sampling
Simple random sampling makes sure that each sample of size n has an equal chance of being selected. There are many ways to take a simple random sample which includes: Using a random number table Using a random number generator on a calculator Using a computer to choose numbers Putting the numbers in a hat and then selecting how many needed for the sample being used. For example: Q. Select 10 random numbers each less than 50 starting from the top left hand corner and working your way down.

Stratified Sampling Stratum means just one sub-population. Strata is the plural word meaning more than one sub-population. There may often be factors that divide the population into sub-populations. This has to be considered to choose a sample from the population to ensure that is it representative of the all population. The size of each sample must be in proportion to the relative size of the groups from which is taken. An example question could be: How many hours should be included from each group? You can check your answer by totalling the answers from each all year groups.

Stratified sampling Shift 1 Shift 2 Shift 3 Shift 4 125 100 85 140
Example: A sample of 90 is taken. You must divide the population within each shift by the total which is 450, and then multiple it by the sample size, which in this case it is 90. Shift 1: 125/450 x 90=25 Shift 2: 100/450 x 90=20 Shift 3: 85/450 x 90=17 Shift 4: 140/450 x 90=28 Shift 1 Shift 2 Shift 3 Shift 4 125 100 85 140

Non-random sampling Higher Statistics
Cluster sampling: Used when the population being sampled splits naturally into groups or “clusters” Then a sample is randomly selected

Non-random sampling Quota Sampling: A quota of subjects of specified type are interviewed. Systematic Sampling: From the sampling frame, a starting point is chosen at random, and therefore items are chosen at regular intervals. Example: 20 students from a total of 100 in the year group. 100/20=5. So every fifth student is chosen after a random starting point. 6, 11, 16, 21, 26, 31...

Collecting data Primary data is data that has been collected by, or for the person who is going to use them. Secondary data is data that has already been collected by someone else. Examples of primary data include: Observing and tally charts (how sunny it is in June) The height of the pupils in year 7 Secondary data may be collected from: Websites Magazines and newspapers Databases Research articles

Surveys Pilot surveys: Conducted on a small Sample to test the
A survey is the collection of data from a given population. The data are used to analyse a particular issue. Primary data may be collected this way. The main methods of collecting primary data in a survey are: Questionnaires Interviews Observations Experiments Data logging Pilot surveys: Conducted on a small Sample to test the design and methods Of that survey

Questionnaires A questionnaire is a set of questions designed to obtain data: Anyone who takes a questionnaire is called a respondent. Rules for writing a questionnaire: Short , simple questions Easily understood words and phrases Avoid leading questions Only address a single issue e.g. does your car run on diesel fuel? Use an interval e.g years , years etc. Avoid embarrassing questions.

2 types of questions Open questions: one that has no suggested answers. Disadvantage is that different answers have to be summarised to be analysed. Closed questions: has a set of answers for the respondent to choose from. Advantage is that it is easier to summarise Examples

Interviews Disadvantages: Emails and postal are least
Different types of interviews are : Sending questions to people by post or . Calling people on the phone Face to face personal questioning Disadvantages: s and postal are least likely to get a response. Telephones are expensive and so are personal interviews. Telephone and personal can be Bias! Advantages: Postal and s are cheap. Response for telephone is good. Personal interviews are good for complex questions.

Investigations and experiments
By doing some experiments, it can help collect data. There are 5 ways in completing an experiment which can be part of a statistical investigation, which are: Before and after experiments Control Groups Matched pairs Data Logging Capture-recapture method

Investigations and experiments
Control groups- This is when one group is tested with a drug to see the effect, whereas the other group is not. The control group is the one that is given the inactive substance, but the other group is. These groups are randomly selected and the effectiveness of the drug can then be assessed by comparing the groups. Matched pairs- This is when two individual a group has everything in common apart from the factor being tested. Identical twins can be important. Data logging- A method which is electronically or mechanically, that is automatically collecting primary data. Capture-recapture method- This is an estimation of the population. The population is of size N, which is the number to be estimated. First of all, you must capture the members of the population whish is M, and mark them and then release them. After waiting some time, you recapture the n number of the population. This is so that they have time to mix. Then the number of m that has been marked is then recorded. The formula for this method is: N=Mn/m

Example Forty fish in a lake are caught, marked and returned to the lake. A second sample of 100 fishes are later caught. Of these 100 fished, 10 are marked. Estimate the number of fished in the lake. Answer: N=? N=Mn/m M=40 N=40x100/10 n=100 N=400 fishes m= 10

Replication & direct observation
This means repeating the experiments, therefore if the results are similar each time, the results are more reliable. Direct Observation: This means recording down the behavioural patterns of people, or items that are being tested in a systematic manner. For instances, the number of yellow cars driving down the road within 30 minutes.

Similar presentations