Presentation is loading. Please wait.

Presentation is loading. Please wait.

32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 1: Descriptive Statistics Lecturer: Mahrita Harahap Mahrita.Harahap@uts.edu.au.

Similar presentations


Presentation on theme: "32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 1: Descriptive Statistics Lecturer: Mahrita Harahap Mahrita.Harahap@uts.edu.au."— Presentation transcript:

1 32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 1: Descriptive Statistics Lecturer: Mahrita Harahap B MathFin (Hons) M Stat (UNSW) PhD (UTS) mahritaharahap.wordpress.com/ teaching-areas Background. 30% Quantitative Assignment due on May 24th Wednesday 9pm Faculty of Engineering and Information Technology

2 Purpose of Quantitative Research
In many disciplines, researchers wishing to publish are asked to provide a rigorous statistical analysis. Reviewers are often specific about what statistical measures they want included.  How does one decide which statistical procedure is the most appropriate? How does one correctly interpret the statistical output? This course is designed with an emphasis on understanding the introductory concepts of statistical procedures and on interpreting computer statistical output. So what is the purpose of quantitative research? In many disciplines, researchers need to provide a rigorous statistical analysis to support their findings. Reviewers would sometime send your manuscript back recommending major or minor revisions in the statistical analyses section, critiquing researchers on poorly designed experiments or on not explaining why they chose that specific statistical test/method. Sometimes they are critiqued on not determining an appropriate sample size prior the study or they might be collecting inappropriate variables or the samples they have chosen are biased and don’t represent the population they are meant to generalise. If you are doing quantitative research in your study, it is important to improve your quantitative literacy and be familiar with statistical lingo terms. So this is what the next four weeks will be about. The next 3 lectures is designed with an emphasis on understanding on which statistical test to use and on how to interpret the computer statistical output. Week 1

3 Statistical Programs Statistical analyses require specialised software to perform calculations: R (programming) SPSS (menu driven) Minitab (menu driven) SAS (menu driven) Python (programming) Instructions on how to obtain computer printouts will be provided with an emphasis on interpreting the computer printout (most packages produce similar printouts). There will be computing lab sessions throughout the course after the lectures. Lab sheets are provided in SPSS and R. Assignment due on May 24th Wednesday 5pm In order to do statistical analysis you require specialised software to perform calculations. SPSS is a menu driven statistical program. R is an free open source statistical programming package. You have a choice whether to undertake the labs or assignment in SPSS or R. R is a nice skill to learn but may be challenging to pick up on these next four weeks. I’ve compiled the labs in SPSS as well as in R instructions. Both should produce similar outputs. The upside on downloading R and R studio is you can do these labs and assignment at home in your own time. The downside of SPSS is you can only access SPSS at UTS computers. Unless you’ve bought the program for yourself. 30% Quantitative Assignment due on May 24th Wednesday 5pm. Week 1

4 Steps in Quantitative Research
What do you want to know? How do you choose your sample? How do you ask? (questionnaire design) Data collection Data analysis Interpret the results Problem Definition Research, from hypothesis development through finished manuscript, is a process.  Hence, the results section of the manuscript is the product of all of the earlier stages of the research.  The better the quality of these earlier stages, the better the quality of the results section. So make sure the problem is well defined before doing any analysis on the data. Sometimes researchers look for statisticians to help them with their research at the Data Analysis stage but then they find out they looked for help a little too late. It is helpful to get them involved earlier in the research process in the problem definition. A famous statistician John Tukey once said: An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem. Week 1

5 Misuse of statistics in Research
Scientific Studies: Last Week Tonight with John Oliver  Ted Talk on Three ways to spot a bad statistic There is a misuse of statistics in research. This is due to people not being statistical literate and for the reasons I highlighted earlier. Media sensationalises studies and infer causation when there is only a correlation due to people being interested in popular science. But of course we cannot just blame the media, sometimes researchers oversimplify the results. Sometimes they use small biased sample sizes or sometimes they alter data to get the significant results they want. Some Research is funded by private organisations with a hidden agenda. Not enough replication studies are funded or published. If you have some spare time on your hands outside of class time, I recommend watching these two videos. These are highly important issues researchers should take note about. Week 1

6 Course Outline Topic 1: Descriptive Statistics
Introduction and Types of Data Graphical Displays of Data Measuring Centre and Spread Univariate Analysis and Bivariate Analysis Topic 2: Inferential Statistics I – Setting up a Hypothesis test How to set up a Hypothesis Test Type I and Type II Error Inference on a mean – 1 sample t test Inference on a proportion - 1 sample proportion test Topic 3: Inferential Statistics II – Comparison Between Groups Inference on several means – Independent t test, Paired t-test, ANOVA test and Multiple comparison test Nonparametric Tests – Wilcoxon, Mann-Whitney and Kruskal-Wallis tests Determining sample sizes Topic 4: Inferential Statistics III – Bivariate Analysis Contingency Analysis - Two-way tables, Chi-Square Test and Fisher Exact Test. Simple Linear Regression Analysis - Interpreting Coefficients, correlation and the R2, F-test, model assumptions and prediction. This is the course outline for the next four weeks. Today we will look at descriptive statistics and the next 3 lectures we will look at inferential statistics. I can only help you with the content related on this slide during class time. If you have a question regarding your research which is related to this content please leave it until the end of the lab as we have a short amount of time to cover the material. If you have a question regarding your research which is NOT related to this content, we have a statistical consultant at UTS and his name is Dr Tapan Rai.

7 Statistics Support at UTS
Statistical Consultation or by Collaboration Graduate Research School Statistics Short Courses Design and Analysis of Questionnaires Introduction to the Design of Experiments Regression Analysis Longitudinal Data Analysis Lecturer is Dr Tapan Rai Maths Study Center (CB ) drop in tutoring 12 – 5pm weekdays Statistical Courses available in SPRING at UTS Regression Analysis (uses SPSS) Advanced Statistical Modelling (about GLMs uses SAS) Multivariate Statistics (about exploratory data analysis uses R) Design and Analysis of Experiments (in AUTUMN only) Programming for Informatics (uses Python) Statistical Methods (statistics postgraduate course) Hacky Hour is a weekly meetup where researchers can congregate to work on their research problems related to code, data, or digital tools in a social environment. Bring along your challenges, digital tools or code to share. Have your technical research problems solved or find someone who can help you. Every Thursday 3 – 4pm at Penny Lane. Statistics is a large field as you can see here. If you want a statistician to be on board with your research you can either hire them as a consultant or work with them as a collaborator and have them as one of your co authors. The latter might be challenging to find since statisticians are busy with their own research but it never hurts to ask. For consultation, the initial first consultation is free from my understanding. The GRS short courses are designed for researchers who wish to expand their statistical skills to some advanced topics and are run in the School twice a year. Some of these courses in spring require a prerequisite so make sure you’ve completed that subject.

8 Statistics Support Online and Outside UTS
Lynda.com is a vast online library of instructional videos covering the latest in technology skills taught by accomplished teachers and recognised industry experts. Help on statistical concepts and data analysis on SPSS and R. Datacamp.com help on programming skills in R and Python. KhanAcademy.org help on understanding statistical concepts in depth via short videos. mahritaharahap.wordpress.com/teaching-areas help on introductory statistical concepts.ng- Statistical Society of Australia (SSAI) seminar meetups. Meet some statisticians. Log in lynda for free through the uts library website Week 1

9 Definitions A population includes all individuals, measurements or objects of interest. A sample is all the cases that we have collected data on (a subset of the population of interest). A parameter is a number that describes some aspect of a population. A statistic is a number that is computed from data in a sample and describes some aspect of that sample. Statistical inference is the process of using data from a sample to gain information about the population. Week 1

10 In statistics we usually want to analyse a population parameter but collecting data for the whole population is usually impractical, expensive and unavailable. That is why we collect samples from the population (sampling) and make conclusions about the population parameters using the statistics of the sample (inference) with some level of confidence (level of significance). The main job of quantitative researchers is that we want to analyse a population parameter but collecting data for the whole population is usually impractical, expensive and unavailable. Example of impracticality. That is why we collect samples from the population (sampling) and make conclusions about the population parameters using the statistics of the sample (inference) with some level of confidence (level of significance). Ideally we should only inference if our sample is not biased. Week 1

11 Choosing a sample In quantitative research we usually want to find out about the population based on a sample or subset of the population. This is only feasible if our sample selection is free of bias. Sampling bias occurs when the method of selecting a sample causes the sample to differ from the population in some relevant way. One of the main misuses of statistics in research is using biased samples. Sampling bias occurs when the method of selecting a sample causes the sample to differ from the population in some relevant way. It is good practice and it is important to grab a random sample to make generalisation about the population. Week 1

12 Sampling bias If sampling bias exists, we cannot trust generalisations from the sample to the population An example of a biased sample may be say you wanted to know the average height of Australias. You asked the Australian mens basketball team as a sample for their height. Why would this be a biased sample? Another example, say you wanted to know who is the preferred candidate to be the Prime Minister of Australia so you attend a One Nation party and hold a voluntary poll on who they would prefer. Why would this be a biased sample? Voluntary polls are always biased because only people who feel strongly about the topic would respond. Voluntary samples are biased samples. i.e. voluntary polls Week 1

13 How can we avoid sampling bias?
Take a random sample. Various methods of random sampling include simple random sampling (every member of the population has an equal chance of being chosen) systematic random sampling (choose first item randomly, and then at regular intervals) stratified random sampling (divide the population into strata and then randomly sample within each stratum) Week 1

14 Other forms of bias Even with a random sample, data can still be biased, (especially when collected on humans) Other forms of bias to watch out for in data collection: Question wording Context Inaccurate responses Many other possibilities – examine the specifics of each study! Always think critically about how the data were collected, and recognize that not all forms of data collection lead to valid inferences. Wording matters and can affect the outcome. In practice, of course, we should try to actually write questions in a way that will NOT bias the results. The first step is to recognise the effect of leading questions. Here is a video explaining the effect and bias from asking leading questions: Week 1

15 Statistical Writing Research, from hypothesis development through finished manuscript, is a process.  Hence, the results section of the manuscript is the product of all of the earlier stages of the research.  The better the quality of these earlier stages, the better the quality of the results section. Provide enough detail such that your audience can understand what you did and why you did it. The more you understand about a statistical technique, the easier it is to describe it to others. The results section usually contains two parts:  the descriptive statistics and the analyses. The descriptive statistics are important because this is often the vehicle by which your variables are introduced to your audience. In the analysis part of the results section, you will want to describe your specific hypothesis, the statistical technique that you will be using, and the model. The results section usually contains two parts:  the descriptive statistics and the analyses. These two parts should be closely related. The descriptive statistics are important because this is where you introduce the variables to your audience. In the analysis part of the results section, is where you describe your specific hypothesis, the statistical technique that you will be using, the model and how do your findings relate to the hypothesis question you are trying to answer. When you do describe the statistical technique you will be using, provide enough detail such that your audience can understand what you did and why you did it. The more you understand about a statistical technique, the easier it is to describe it to others. Week 1

16 Descriptive and Inferential Statistics
Descriptive Statistics – describe the sample Inferential Statistics – use the sample to test theories about the population The purpose of descriptive statistics is to become familiar with the data that you have collected. What information can you get from the data? What are we looking for? What is a ‘typical’ response How different are responses from different individuals Do responses differ between different groups of individuals Week 1

17 Nominal (no order) e.g. Nationality, Gender, Month
Types of data The way that we look at data will depend on the type of data that we have. Categorical (divides the cases into groups/categories) Nominal (no order) e.g. Nationality, Gender, Month Ordinal (order) e.g. Satisfaction Level or level of education Quantitative (measures a numerical quantity for each case) Discrete takes whole number values e.g. number of birds in a tree, shoe size Continuous e.g. height, Celsius temp The types of statistical tests or operations that can be performed on your data depend on the types of data. If you choose the wrong scale, your ability to prove your hypothesis will be compromised. So you need to know what you want to show in order to know what type of measure you need at the beginning. Week 1

18 Types of studies An observational study is a study in which the researcher does not actively control the value of any variable but simply observes the values as they naturally exist. Observational studies can almost never be used to establish causation. An experiment is a study in which the researcher actively controls one or more of the explanatory variables, known as treatments. If a randomised experiment yields an association between the two variables, we can establish a causal relationship from the explanatory to the response. Two variables are associated if values of one variable tend to be related to the values of the other variable. Two variables are causally associated if changing the value of one variable directly influences the value of the other variable. If two variables are associated with each other it does not mean one variable directly causes the other. CORRELATION DOES NOT IMPLY CAUSATION!  That said there are two types of studies. An experiment is where you assign treatments to certain observations or units. Whereas an observational study is where you observe the values as they naturally exist. You cannot establish causation under observational studies but you can with experiments. Week 1

19 Types of Graphical displays
Univariate Categorical Data Quantitative Data Bar Charts to depict frequencies Pie Charts to depict proportions Histogram to look at the distribution Boxplot graphical summary statistics The visualisation options available to use, depend on the sorts of variables that we have. 

20 Displaying categorical data
Bar Graph Pie Graph Good for comparing the number of individuals in different groups. Good for looking at parts as a whole.

21 Displaying quantitative data
Histogram Boxplot Overall picture shows the distribution and the kind of symmetry in the data. Good for getting an overall ‘picture’ of the data Good for finding unusual observations and looking at the symmetry of the data Week 1

22 Symmetry We measure the symmetry of a distribution by calculating the skewness of the distribution. If the skewness is negative, then the distribution is left skewed. If the skewness is positive, then the distribution is right skewed. If the skewness is close to 0, then the distribution is reasonably symmetric. In general, we will only state that a distribution is skewed if the skew is strong and obvious. Left-Skewed Symmetric Right-Skewed Week 1

23 Presenting two variables Graphically
Bivariate Categorical x Categorical Categorical x Quantitative Quantitative x Quantitative Multiple Bar Charts to compare frequencies Crosstabs to depict frequencies, row and column proportions Multiple Comparison Boxplots Scatterplots relationship between two numerical variables

24 Measures of central tendency
Where is the ‘middle’ of the data We could also think of it as what we would expect to measure for a typical respondent. Three different measures Mean Median Mode Week 1

25 Measures of central tendency
Mean = add the observations and then divide by the number of observations The mean can be quite sensitive to large values and skewness – avoid using the mean to describe skewed data e.g. Suppose that we have the observations then the mean of these observations will be given by Week 1

26 Measures of central tendency
Median = the middle observation (n odd) or the average of the two middle observation (n even). The median is useful when describing skewed data. Unaffected by extremely large and extremely small values. Also known as the 50th percentile. Min = 1 Q1= 2 Median = 3.5 Q3 = 4 Max = 10 Week 1

27 Measures of central tendency
Mode = the most frequently occurring observation E.g. the mode of is 4, since 4 appears the most. Week 1

28 Types of data – Measures of central tendency
Categorical Nominal (no order) e.g. Nationality, Gender Ordinal (order) e.g. S,M,L or level of education Quantitative Discrete takes whole number values e.g. number of birds in a tree Continuous e.g. height Mode Median or Mode Mean, Median or Mode Mean or Median Week 1

29 Revisiting Symmetry Week 1

30 Measures of dispersion
How spread out are my data? How does my data vary? Measures: Standard deviation Range Interquartile range (IQR) Coefficient of Variance (CV) Week 1

31 Measures of dispersion
Why do we dive by n-1? Standard deviation – on average, how far away each point in the dataset is from the mean 𝑠= 𝑖=1 𝑛 (𝑥 𝑖 − 𝑥 ) 2 𝑛−1 and is measured in the same units as the data. If our data had a normal/bell-shaped/symmetric distribution, then we can apply the rule 68% of our data is within 1 standard deviation of the mean 95% of our data is within 2 standard deviations of the mean 99.7% of our data is within 3 standard deviations of the mean Week 1

32 Measures of dispersion
The range is the simplest measure for dispersion, given by: Range = Maximum Observation – Minimum Observation Very sensitive to extreme values Example: Suppose that the 10 in our data set was actually a Then we have Then the range becomes = 99 instead of the 9 that was previously observed. Week 1

33 Measures of dispersion
The interquartile range is the difference between Q3 (75th percentile) and Q1 (25th percentile), IQR = Q3 – Q1 not sensitive to extreme values like the range can be used for skewed data For our example: IQR = 4 – 2 = 2 Week 1

34 Example: John Graunt’s Data
The following information comes from John Graunt’s observations (1662) on the London bills of mortality. He was unable to collect data for the years 1637 to 1646. Year Cancer Measles 1629 20 42 1630 14 2 1631 23 3 1632 28 80 1633 27 21 1634 30 33 1635 24 1636 12 1647 26 5 1648 29 92 1649 31 Year Cancer Measles 1650 19 33 1651 31 1652 53 62 1653 36 8 1654 37 52 1655 73 11 1656 153 1657 24 15 1658 35 80 1659 43 6 1660 74 Week 1

35 Revisiting the boxplot
Outlier Q3 Median IQR Q1

36 Measures of dispersion
The coefficient of variance is used when you are comparing the variability between groups with different means. It is defined as the ratio of the standard deviation to the mean, expressed as a percentage. Dividing the standard deviation by the mean standardises the measure of variability so it is suitable for comparison. 𝐶𝑉= 𝑠 𝑥 ∗100 Week 1

37 Types of data – Measures of dispersion
Categorical Nominal (no order) e.g. Nationality, Gender Ordinal (order) e.g. S,M,L or level of education Quantitative Discrete takes whole number values e.g. number of birds in a tree Continuous e.g. height None IQR Range, IQR SD CV Range, IQR SD CV Week 1

38 Presenting two variables Graphically
Bivariate Categorical x Categorical Categorical x Numerical Numerical x Numerical Multiple Bar Charts to compare frequencies Crosstabs to depict frequencies, row and column proportions Multiple Comparison Boxplots Scatterplots relationship between two numerical variables Until now we have looked at one variable at a time(called a univariate analysis). Now we will look at two variables at the same time, known as a bivariate analysis. We can have a look at pairs of variables graphically. The tool that we use depends on the type of variables that we are looking at.

39 Example: Pets and CHD Psychological and social factors can influence the survival of patients with serious diseases, One study, published in Public Health Reports in 1980 , examined the relationship between the survival of patients with coronary heart disease (CHD) and whether they survived for one year. Each of 92 patients was classified as having a pet or not, and by whether they survived for one year. The data were entered into SPSS, and the printout below gives the table of original values with row and column percentages. Week 1

40 Heart disease and pets Setting out data in SPSS Patient ID Survive Pet
1 2 3 : 92 Nominal (Categorical) There are two categorical variables in this study. Week 1

41 Multiple bar chart for categorical x categorical data
We can also use cross tabulations to look at categorical x categorical data Vertical axis is the frequency of survival. Horizontal axis is whether or not they had a pet. Week 1

42 Cross tabulation: Heart disease and pets
Pet Ownership * Patient Survival Crosstabulation Patient Survival Total No Yes Pet Ownership No Pet Count 11 28 39 % within Pet Ownership 28.2% 71.8% 100.0% % within Patient Survival 78.6% 35.9% 42.4% Pet 3 50 53 5.7% 94.3% 21.4% 64.1% 57.6% 14 78 92 15.2% 84.8% Lets look at one cell and see what the numbers mean. Week 1

43 Example: John Graunt’s Data
The following information comes from John Graunt’s observations (1662) on the London bills of Mortality. He was unable to collect data for the years 1637 to 1646. Year Cancer Measles 1629 20 42 1630 14 2 1631 23 3 1632 28 80 1633 27 21 1634 30 33 1635 24 1636 12 1647 26 5 1648 29 92 1649 31 Year Cancer Measles 1650 19 33 1651 31 1652 53 62 1653 36 8 1654 37 52 1655 73 11 1656 153 1657 24 15 1658 35 80 1659 43 6 1660 74 There is one categorical (cancer or measles) and one quantitative variable (amount of deaths) in this study. Week 1

44 Multiple boxplots for categorical x quantitative data
Outlier (>1.5xIQR) What do the red boxes represent? What about the circles and asterisk? What are the minimum and maximum values? Do the no. of deaths per year seem different for measles and cancer patients? Outlier (>3xIQR) Week 1

45 Scatterplot for quantitative x quantitative data
A scatterplot of weight against height with markers by sex. Can you describe the relationship between height and weight? What effect does sex have? Are there any unusual points? Y - Dependent Variable, Response Variable, Variable to be Predicted Yes there is an outlier. X - Independent Variable, Predictor Variable, Explanatory Variable Week 1

46 Summary Type of Variable Type of Graphs Central Measure Spread Measure
Quantitative Variable Boxplot Histogram Discrete – Mean, Median, Mode Continuous – Mean, Median IQR Range SD CV Categorical Variable Bar chart Pie chart Frequency Table Nominal – Mode Ordinal – Median, Mode Nominal – None Ordinal – IQR Quantitative vs Categorical Multiple comparison boxplots Categorical vs Categorical Multiple bar chart Crosstabs Quantitative vs Quantitative Scatterplot Yes there is an outlier. Week 1

47 Use the discussion boards if you have any questions regarding content
Use the discussion boards if you have any questions regarding content. Topic II: Inferential Statistics Mahrita Harahap


Download ppt "32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 1: Descriptive Statistics Lecturer: Mahrita Harahap Mahrita.Harahap@uts.edu.au."

Similar presentations


Ads by Google