OLI Module 1 to 4: Introduction & Exploratory Data Analysis

OLI Module 1 to 4: Introduction & Exploratory Data Analysis

Here’s some data… Everyone gets a card; each group gets a corner (hearts, diamonds, spades, clubs) Below are two sets of data. Discuss in your group what we can say about each data? What could each represent? What couldn’t each represent? Data Set #1: 97, 98, 94, 92, 31, 98, 93, 95, 97, 98, 98 Data Set #2: 4.0, 6.8, 7.1, 7.1, 7.2, 7.4, 7.7, 7.8, 12.1 Each group share out

Data is/are… Observations that you or someone else records
But data is/must be more than just numbers; it is numbers in context ... the story behind the numbers... That’s where statistics come in... According to textbook author Dr. Robert Gould, “The goal of statistics is finding meaning in data.”

Who collects data? And why do they collect data?
Think for 1 minute. Share with the person next to you for 1 minute. Share out to entire class (random selection; number off).

Your next assignment!  Listen to the radio or watch TV news. Listen/look for a recent study & findings newscasters are reporting Summarize the study/report Why do you think they did the study/What was the purpose? How do you think they got their data? Typed; printed out (do not submit via ); at least 5-6 complete sentences. Due: next class, at beginning of class.

Here’s an example...According to a study by UCLA ...
...13 million adults, or 46 percent of the state’s population, are believed to be living with the precursor of type 2 diabetes or undiagnosed diabetes. Researchers predict that 33 percent of young adults, aged 18 to 39, are pre-diabetic, which is rare since diabetes is generally more common among older adults. According to researchers, up to 30 percent of residents living with pre-diabetes will develop type 2 diabetes within five years. As many as 70 percent of adults will develop the disease within their lifetime. Why do you think ULCA is interested in this information? How do you think UCLA get these numbers, percents, predictions?

It starts with a topic, followed by a question...
Let’s think of a topic... Now, what are some questions about that topic? Write on board... and/or share out Dr. Gould, UCLA

Number off; consider 1 or 2 variables & come up with a ‘good’ question regarding that 1 or 2 variables... Dr. Gould, UCLA

COC Math 140 Introductory Activity for Modules 7 to 10
See my website, articles assignments activities & more, at the bottom 20 – 30 minutes. Then report out/share, then revise based on comments/feedback Debrief activity

Remember, Context is Key!
Always, always comment, answer, compare, contrast… whatever the case.. in context! How can we find meaning if we don’t have context? Remember what Dr. Gould said, “Goal of statistics is finding meaning in data.” What are the objects? What was measured? What are the units of measure? Think about how we started our class tonight with the 2 data sets I gave you...

Two types of variables…
Look at the eight questions we considered in our activity we just did Could we organize them into two different types of data, two different types of variables? How?

Sometimes it’s in camo…
Can you think of a categorical data that looks like numerical data… but it isn’t. It’s really categorical. Discuss for a minute…

Sometimes it’s in camo…
Can you think of a categorical data that looks like numerical data… but it isn’t. It’s really categorical. Discuss for a minute… Always ask yourself, does finding the mean (average) of this data make sense?

How many... Come up to board and write the number of different types of social media YOU have used TODAY; write anywhere; no need to organize in any special way. If you are male, please use a blue marker If you are female, please use a black marker

How many... Number of different types of social media YOU have used TODAY; blue: male; black: female One minute to talk to the person next to you about one observation you can make about our data; be prepared to share out your observation

How many... Number of different types of social media YOU have used TODAY; blue: male; black: female First, it’s always helpful to ... Second, and probably more importantly, it’s always helpful to ...

Graphical representations...

Graphical representations...
Talk to the person next to you for 2 minutes. What type of graphical representation would you choose to best represent this data and why (your group doesn’t actually have to create the graphical representation at this time). Be prepared to explain/justify your reasoning/your choice. Share out.

Graphical representations for numeric (or quantitative) data include...
Dot plots Stem (and leaf) plots Histograms Box plots (later...) (and much later) ... Density curves, scatter plots, least-squares regression lines, Normal probability plots, etc. Why didn’t I list pie charts or bar graphs?

No matter what... We always want to create a graphical representation; visuals help us process information, indentify trends more easily We always label & scale our graphical representations We always use technology when available (no need to create graphical representations by hand)

Lets create some graphical representations using our class data...
Dot plot... What’s good about dot plots? What’s not so good? Histogram... What’s good about histograms? What’s not so good?

Stem (and leaf) plots... What’s good about stem plots? What’s not so good?
Box Plots... In Stat Crunch, but will learn much more about box plots later...

Let’s practice one more graphical representation... Partner class work
Go to my website, click on COC Math 140 Survey Data spreadsheet. Find the column ‘How much do you weigh (in pounds).’ Copy and past into a column in Stat Crunch. Create a histogram, a stem plot, a dot plot, or a box plot of this data (your choice). Be sure to label your graphical representation. Put both your names on it. Looking at your graphical representation, what can you say about the distribution/the data? Be prepared to share out one thing you observe in the graph (we will display your graph up on screen so we can all see it as you describe it) We will print it later and turn it in... But we will do something else with it in a few... CI’s, Hyp testing, sampling distributions, etc. What’s likely? Unlikely?

Note on histograms... Frequency vs. Relative Frequency

Let’s look at Graphical representation of ‘weights’...
No matter which graphical representation you created with this data set, how did we describe the graphical representations? What types of characteristics did we consider when trying to describe the graph of this data?

SOCS... S – Shape. Symmetric? Skewed? Uni-Modal,
bi-modal, tri-modal, multi-modal? Gaps? O– Outlier(s) Is/are there unusually large or small values that are “away” from the majority of the rest of the data? C – Center What is the “typical*” value of the distribution/data? S – Spread Typically/on average*, how far apart or close together is the data/distribution? * Different types of ‘averages’ and ‘typical’. Will discuss further and in detail soon.

Let’s describe our data using SOCS...
Practice: Lets look at our social media data with a histogram, dot plot, stem plot, or box plot; & describe the distribution using SOCS. What is likely? Unlikely? What type of a statement could you make (based on this data) about ALL COC students regarding social media? Now with the graphical representation you and your partner created from the ‘weights’ data, describe the distribution using SOCS. You have 10 minutes. You will turn this in as an assignment. Be prepared to do a 1-minute share out as I will randomly call on a few pairs to share out

What types of graphical representations can we use? What can we not use? Why?
Type of first pet ... or favorite social media, favorite app for cell phone, hair color, make of car you drive, marital status, etc.

Graphical representations for categorical (qualitative) data...
Bar (charts) graphs (caution; very different from histograms; why?)

Caution... Bar graphs vs. histograms...
On left is bar graph; on right is histogram Be sure you understand the difference between the two graphical representations

Now back to Graphical representations for categorical (qualitative) data...
Bar (Charts) Graphs Pie Charts BIG IDEA... the same... visualizing data can be helpful in observing trends Can we analyze pie charts or bar graphs with SOCS? Why or why not? Whether categorical or numerical, always good to graph your data

Let’s graph SOME data... Let’s go to the Math 140 data set, and choose a set of categorical data; cut and paste into Stat Crunch; create bar graph and pie chart; make observations; ask questions With a partner, choose a different categorical data set, practice creating a bar graph AND a pie chart using the data; make observations; ask questions; we will share out in 10 minutes.

Deception… watch for it…

What’s wrong?

Deception…

… fixed ... Sort of....

Different bin widths in histograms... Not a good thing –very deceptive

More deception…

Deception… with the data, and the graphical representation…

Class/Group Activity…
Form groups of 3 randomly (how would we like to do this?) Each group will have a measuring tape. The first person stands (preferably in front of a wall) and imagines that she or he is at an ATM getting cash. The second student stands behind the first. The first student tells the second student how far back he or she must stand for the first student to be just barely comfortable, saying for example, “Move back a little, now move forward just a tiny bit,” and so on. When that distance is set, the third student measures the distance between the hell of the first person’s right shoe to the toe of the second person’s right shoe. That will be called the ‘personal distance.’ First, answer the BEFORE THE ACTIVITY questions below (1 paper for the whole group): 1. Do you think men and women will have different personal distances? Why? Will the larger distances be specified by the men or the women? 2. Which group do you think will have distances that are more spread out? 3. What do you think the shape of each of the distributions will be? For each student in your group, record the gender and personal distance. Write each of these personal distances on the board. Use blue for male and black for female. Note: Be respectful of other people’s personal space. Do not make physical contact with other students during this activity.

Class/Group Activity…
Input data into StatCrunch 1-2 paragraph write up which answers the question, “Do men and women have different personal distances?” Include graphs (justify your group’s choice of graph) & numerical analysis (SOCS) of data/graphs (from Stat Crunch; cut and paste) All members of group must contribute Maximum points possible: 20 project points. From Robert Gould, Introductory Statistics

Modules 8, 9, & 10

Module 7... Four Corners: Go to your corner based on if your birthday falls in the Winter, Spring, Summer, or Fall; 1 minute In your group, come to a consensus about the three most important topics we learned and list them on the board. 5 minutes.

Module 7, we learned... Appropriate graphical representations (numerical & categorical data) Always graph the data; always. Always embed context. Always. Describing numerical distributions/data sets via SOCS (the basics; we will get more sophisticated with our descriptions soon); do we use SOCS to describe categorical data distributions? Why or why not?

SOCS... Shape, Outlier(s), Center, Spread
We loosely defined ‘center’ and ‘spread’ Now we will be much more specific & detailed ... And remember, always embed context Here we go  ...

Word association time... When I say a word, you immediately write down what you think it means; don’t think, just write. Don’t talk; don’t say anything to anyone. Ready?

Word association time... Average

Patrons in a diner... $45,000 $48,000 $52,000 $40,000 $35,000 $58,000
The annual salaries of 7 patrons in a diner are listed below. Find the mean and the median using Stat Crunch Are the mean and the median similar? Would they represent a ‘typical’ or ‘average’ customer’s salary? Should we use the mean or the median in this case? Graph the data (let’s practice a histogram; then a box plot) using Stat Crunch. What shape is the distribution? $45,000 $48,000 $52,000 $40,000 $35,000 $58,000 $46,000

Now, Bill Gates walks into the diner...
Find the mean and the median using Stat Crunch Are the mean and the median similar? Would both or either represent a ‘typical’ or ‘average’ customer’s salary? Should we use the mean or the median in this case? Graph the data (histogram; box plot) using Stat Crunch. What shape is the distribution? $45,000 $48,000 $52,000 $40,000 $35,000 $58,000 $46,000 $3,710,000,000

What’s the moral of this story?
Means are excellent measures of central tendency if the data is (fairly) symmetric However, means are highly influenced by outlier(s) So, if the data has an outlier(s), then a better measure of central tendency is the median, which is not influenced by outliers; this is called ‘resistant’ So, consider the shape of data/distribution, then wisely choose an appropriate measure of central tendency

Which measure of central tendency should we use?
.

Which is larger: mean or median
Which is larger: mean or median? Which should we use to describe the ‘typical’ or middle value?

The ‘C’ in SOCS So, when we are analyzing a numerical distribution (like looking at a histogram, stem plot, box plot, etc.), we need to wisely choose which ‘C’ to use... mean or median Generally, if symmetric use mean (or median) as a measure of central tendency; they will be similar in value (or the same) If skewed (left or right) use median as a measure of central tendency; why?

Measures of Spread What is the median of each of the following data sets? What is the mean of each? (4, 4, 5, 6, 6) (5, 5, 5, 5, 5) Are they the same distribution/data set? Another characteristic that is helpful in describing distributions/data sets is the measure of spread (or the typical distance from the center)

Spread... The second ‘S’ in SOCS
Another characteristic that is helpful in describing distributions/data sets is the measure of spread (or the typical distance from the center) Two measures of spread that we will focus on in this course are the standard deviation & inter-quartile range

Standard Deviation is ... a typical distance of the observations from their mean is a number that measures how far away the typical observation is from the center of the distribution

Let’s play the standard deviation game...
Your team’s task: Create a data set of four whole numbers (from 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) with the lowest standard deviation value possible Input your four numbers (again use numbers from 0 to 10 only) into Stat Crunch, then calculate the standard deviation Change a value or values until you get the lowest possible standard deviation you can. 3 minutes. Go. Now create a data set (again only from 0 to 10) with the largest possible standard deviation.

Which has the largest SD?

Calculating the standard deviation...

Variance... Another measure of spread
Not used very often; usually, if we use a mean as a measure of central tendency, we use the standard deviation as our measure of spread Variance is related to standard deviation variance = (standard deviation)2 standard deviation =

Data gathering time again...
# siblings you have on board & enter into Stat Crunch Numerical analysis (statistical summary in Stat Crunch) and graphical representation Describe the distribution

Skewed? Shouldn’t use mean & SD
But we still need to describe the center and the spread of the distribution Use median and IQR (Inter-quartile Range) Median & IQR are not effected by outlier(s) (resistant) IQR = Q3 – Q1 IQR is amount of space the middle 50% of the data occupy

Range of data... Another measure of variability (used with any distribution) is range Range = maximum value – minimum value Range for our data =

Boxplots ...based on 5-number summary

Boxplots...

Modified boxplot – shows outlier(s)

Two modified boxplots...

What are outliers? Boxplots are the only graphical representation where we specifically define an outlier Potential outliers are values that are more than 1.5 IQRs from Q1 or Q3 IQR x 1.5; add that product to Q3; any value(s) beyond that point is an outlier to the right Q1; any value(s) beyond that point is an outlier to the left

Go back to our Siblings data...
Using Stat Crunch, calculate descriptive statistics Let’s calculate (by hand) to see if we have any outliers Q3 – Q1 = IQR IQR x 1.5; add this product to Q3; are there any values in our data set beyond this point to the right? IQR x 1.5; subtract product from Q1; are there any values in our data set beyond this point to the left? Now use Stat Crunch to create a boxplot; are our calculations confirmed with our boxplot?

Be careful with outliers...
Are they really an outlier? Is your data correct? Was it input accurately? COC’s recent 99-year-old graduate Don’t automatically throw out an unusual piece of data; investigate

Be careful... one more thing...

Partner Practice ...

Your turn... In pairs, choose a set of data from the Math 140 spreadsheet that is skewed (to left or right); you probably won’t know if the data is skewed until you copy and paste into Stat Crunch and create a graph Create a box plot; print out; put your names on it Label (on the graph) the 5-number summary (with arrows pointing to each value on the graph) Analyze through SOCS (which measure of central tendency should you use? Which measure of spread should you use?); be sure you show your work to justify that a point/points are outliers Now, using the same data, create a histogram. What characteristics of the data does the histogram show that the box plot does not?

Classifying Summary Statistics...
1. For each of the following sample statistics, classify it as a measure of spread (variability), a measure of center (average), or a measure of position. Then write a sentence describing what the statistic tells us. a) Mean b) Standard Deviation c) Minimum d) Range e) Median f) Quartile 3 (Q3) g) Interquartile Range (IQR) h) Maximum i) Quartile 1 (Q1) j) Mode k) Variance 2. Which measure of centeris the most accurate for bell shaped (normal) data sets? Which is the most accurate for skewed data sets? 3. Which measure of spread is the most accurate for bell shaped (normal) data sets? Which is the most accurate for skewed data sets? 4. List all the measures of position. 5.. Use Statcrunch and the Bear data to find all of the summary statistics we discussed for the bears weight. You need to give the name of the statistic, the number and the units.

Let’s talk about Exam #1... Will cover Module 1 through Module 10
Topic review sheet on my website

OLI Module 1 to 4: Introduction & Exploratory Data Analysis

Similar presentations

Presentation on theme: "OLI Module 1 to 4: Introduction & Exploratory Data Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

OLI Module 1 to 4: Introduction & Exploratory Data Analysis

Similar presentations

Presentation on theme: "OLI Module 1 to 4: Introduction & Exploratory Data Analysis"— Presentation transcript:

Similar presentations

About project

Feedback