NCAA Basketball Tournament: Predicting Performance

Slides:



Advertisements
Similar presentations
NCAA Basketball Tournament: Predicting Performance Doug Fenton, Ben Nastou, Jon Potter Mathematics 70; Spring 2001.
Advertisements

Tests of Significance for Regression & Correlation b* will equal the population parameter of the slope rather thanbecause beta has another meaning with.
10-3 Inferences.
Baseball Statistics By Krishna Hajari Faraz Hyder William Walker.
Inference for Regression
Inference for Regression 1Section 13.3, Page 284.
© 2010 Pearson Prentice Hall. All rights reserved The Complete Randomized Block Design.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 9-1 Introduction to Statistics Chapter 10 Estimation and Hypothesis.
Analysis of Variance & Multivariate Analysis of Variance
The Best Investment in this Economy (Safer than the S&P) Ana Burcroff Kathleen Fregeau Brett Koons Alistair Meadows March 3 rd 2009 – Team 7.
Hypothesis Testing II The Two-Sample Case.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 10-1 Chapter 2c Two-Sample Tests.
Why Model? Make predictions or forecasts where we don’t have data.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 1): Two-tail Tests & Confidence Intervals Fall, 2008.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
Chapter 16 Data Analysis: Testing for Associations.
Diagnostics – Part II Using statistical tests to check to see if the assumptions we made about the model are realistic.
Slide Slide 1 Section 8-4 Testing a Claim About a Mean:  Known.
Analyzing Statistical Inferences July 30, Inferential Statistics? When? When you infer from a sample to a population Generalize sample results to.
AP Statistics. Chap 13-1 Chapter 13 Estimation and Hypothesis Testing for Two Population Parameters.
Hypothesis Testing. Suppose we believe the average systolic blood pressure of healthy adults is normally distributed with mean μ = 120 and variance σ.
CHAPTER 15: THE NUTS AND BOLTS OF USING STATISTICS.
Virtual University of Pakistan
Outline Sampling Measurement Descriptive Statistics:
Multiple Regression Analysis: Inference
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
The statistics behind the game
Data Analysis Module: One Way Analysis of Variance (ANOVA)
23. Inference for regression
Copyright © 2009 Pearson Education, Inc.
Why Model? Make predictions or forecasts where we don’t have data.
Logic of Hypothesis Testing
GS/PPAL Section N Research Methods and Information Systems
Chapter 14 Introduction to Multiple Regression
QM222 Class 9 Section A1 Coefficient statistics
Parameter Estimation.
Paired Samples and Blocks
ECO 173 Chapter 10: Introduction to Estimation Lecture 5a
Practical Statistics Abbreviated Summary.
Non-Parametric Tests 12/1.
Modify—use bio. IB book  IB Biology Topic 1: Statistical Analysis
Bivariate Testing (ANOVA)
Estimation & Hypothesis Testing for Two Population Parameters
The Skinny on High School
Inferences for Regression
Inference for Regression
Chapter 11: Simple Linear Regression
Math 4030 – 10a Tests for Population Mean(s)
Comparing Three or More Means
Chapters 20, 21 Hypothesis Testing-- Determining if a Result is Different from Expected.
Do Age, BMI, and History of Smoking play a role?
Analysis of Covariance (ANCOVA)
ECO 173 Chapter 10: Introduction to Estimation Lecture 5a
Chapter 5 Hypothesis Testing
Module 26: Confidence Intervals and Hypothesis Tests for Variances for Two Samples This module discusses confidence intervals and hypothesis tests for.
Bivariate Testing (ANOVA)
Chapter 10 Two-Sample Tests and One-Way ANOVA.
Inferential Statistics
The statistics behind the game
CHAPTER 29: Multiple Regression*
Multiple Regression Chapter 14.
Goodness of Fit.
Multiple Regression – Split Sample Validation
Chapter 26 Comparing Counts Copyright © 2009 Pearson Education, Inc.
Inferences for Regression
Chapter 26 Comparing Counts.
Chapter 10 – Part II Analysis of Variance
Exercise 1 (a): producing individual tables, using the cross-tabs menu
STATISTICS INFORMED DECISIONS USING DATA
Presentation transcript:

NCAA Basketball Tournament: Predicting Performance Doug Fenton, Ben Nastou, Jon Potter

Project Goals To examine some of the factors that indicate how well a team will do in the NCAA Men’s B-Ball Tournament To compare factors (such as seed, conference, and individual team) and their effect on a team’s result To create an effective model for how well a team does based on certain factors

Background The NCAA Basketball Tournament is a 64-team, single elimination tournament. This has been the tournament’s format since 1985: we use data from 1985-2000. There are four separate regions, each with sixteen teams seeded 1-16 (with 1 being the best and 16 being the worst). The result (dependent) variable is based on how many games a team wins in the tournament.

General Analysis Result =.599 - 0.151*seed + 2.32*percent The first two (independent) variables looked at were the team’s seed and its winning percentage. The regression was as follows: Result =.599 - 0.151*seed + 2.32*percent (R^2=.387) (1.87) (-18.43) (6.00) As can be seen from this data, both seed and winning percentage had a large effect on the team’s result, with result being positively related to percent and negatively related to seed.

A Quick Look at the Seeds...

#1 Seed vs. #2 Seed We can therefore conclude with 95% certainty Histogram for #1 Seed Histogram for #2 Seed Assuming normality in the distribution of results for #1 Seeds and #2 Seeds, T n+m-2 = (Avg(x)-Avg(y))/(SP*(sqrt(1/n+1/m)) = 3.46 T n+m-2 = 3.46 > 1.64 = T 0.05, 126 We can therefore conclude with 95% certainty that One Seeds outperform Two Seeds.

Differences in Variance Do #1 Seeds have a different variance from #2 Seeds? H0: S12 = S22 vs. H1: S12 ≠ S22 F = S12 / S22 = 2.49/2.22 = 1.12 Critical Value: F(63, 63) with 95% confidence: .600<F<1.67 Therefore, the F-stat falls within the interval and #1 Seeds may have the same variance as the #2 Seeds.

Seed Analysis With other seeds, the result data cannot be assumed to be normal. Therefore, hypothesis testing comparing seeds could not be used However, we were able to test if the probability of a given seed winning the championship was different than 1/16 (If all seeds were created equal).

Does the Top Seed Win the Championship More Often? H0 : p1 = 1/16 vs. H1 : p1 > 1/16 t = (9 - 16*(1/16)) / ((16*1/16(1-1/16))^.5) t = 8.26 > t(.05,15) = 1.74 Therefore, we reject the null hypothesis, which means that the top seed wins the championship more often than if the tournament was randomly seeded.

Do the High Seeds Typically Outperform the Lower Seeds | Hi Seeds (1-8) | Lo Seeds (9-16) | Total Lo Result(0-2) | 392 | 504 | 896 Hi Result (3-6) | 120 | 8 | 128 Total | 512 | 512 | 1024 % Hi Result | .234 | .0152 H0: pH = pL vs. H1: pH > pL Phat = (120+8)/(512+512) = .125 Z = (see Thm. 9.4.1) = 10.58 > 1.64 = Z.05 Hence, as expected, higher seeds outperform lower seeds.

The Conference Variables The teams from our study came from 31 different conferences. These conferences were divided into 4 different tiers based past tournament performance and the number of schools who get into the tournament each year (Tier 1 being strongest conferences; Tier 4 being weakest conferences) We then tested how a team’s conference tier was correlated with their performance.

Comparing Teams’ Conferences We tested the correlation between a team’s conference tier and their performance in the tournament. Likewise, we tested to see if there was significance of a team’s winning percentage given their conference tier. Therefore, we created a dummy variable for each tier and interaction terms between tier and winning percentage.

Results (R^2 = .40) result | Coef. Std. Err. t P>|t| -------------+-------------------------------------------------------- win % | 1.219 .574 2.12 0.034 Tier 1 | -4.40 .543 -8.10 0.000 Tier 2 | -2.148 .824 -2.61 0.009 Tier 3 | -4.189 1.00 -4.18 0.000 %*T1 | 7.90 .756 10.44 0.000 %*T2 | 3.93 1.136 3.46 0.000 %*T3 | 6.20 1.34 4.61 0.000

Is Tournament Fairly Seeded Based on Conference Tier? To see if this is true, we looked at only the top 4 seeds because they seemed the most normal. For each of these seeds, we created four groups, one for each tier; to see if performance was consistent with the conference tier given a team’s seed. ANOVA was used for analysis of: H0: MT1 = MT2 = MT3 = MT4 (for each seed 1-4)

Results -------------------------------------------------- Seed Group | F. | F-critical -------------------------------------------------- Seed 1 | 1.102 | 3.148 Seed 2 | 0.365 | 2.758 Seed 3 | 0.934 | 3.148 Seed 4 | 0.039 | 2.758

Analyzing Certain Teams Dummy Variables were created for teams which had been in at least 12 (75%) of the tournaments. There are not enough data points, and the histograms are too skewed, to assume normality for the team data

A Quick Look at the Teams’ Performances

Is Duke the Best? Duke vs. Kentucky Assuming normality in the distribution of results for both Kentucky and Duke (which may not be a valid assumption), T n+m-2 = (Avg(x)-Avg(y))/(SP*(sqrt(1/n+1/m)) = 0.356 T n+m-2 = 0.356 < 0.856 = T 0.20, 26 Therefore, we cannot reject the null hypothesis that Duke and Kentucky have perform equally well with even 20% certainty

Time Trends? According to this time trend regression, Kentucky would have overtaken Duke in 1994.

Maybe So...

Are Certain Teams Mis-Seeded? If the team’s dummy variable is significant with seed, it suggests that that team is often “mis-seeded” (ie. a team is consistently seeded higher or lower than it should be).

Under-Rated? So, for example, Duke can be expected to win more than one more game than other teams of the same seed, and Illinois can be expected to win more than half a game less than other teams of the same seed. If Duke and Illinois are seeded the same, Duke can be expected to win almost two full games more than Illinois.

Analyzing Experience An experience variable was created to reflect the total number of previous tournament games (won or lost) a team had played since 1985. Result = .952 + .054*experience - .051*year (R^2=.14) (12.80) (12.87) (-5.51) Hence, there is correlation between experience and result, suggesting that teams which have been in the tournament often typically win more games… also, successful teams typically stay successful.

Regression with Experience (R^2=.39) result | Coef. Std. Err. t P>|t| -------------+-------------------------------------------------------- win % | 2.575 .391 6.59 0.000 seed | -.131 .009 -13.72 0.000 exper | .016 .0042 3.84 0.000 year | -0.016 .0081 -2.07 0.038

Experience (cont…) That experience is significant when regressed with seed and winning percentage indicates that it is not fully accounted for in the seeding of teams, and that it is another variable worth looking at when making tournament predictions. The experience variable is significant in a variety of regressions indicating its robustness as an explanatory variable

FINAL REGRESSION

Conclusions Tournament predictions can be fairly accurate based solely on seed There are other predictors such as winning percentage, conference, and experience which can be used to refine predictions However, better teams don’t always win, so it is impossible to make predictions absolutely