Presentation on theme: "The Model Following these assumptions, I propose a hierarchical model with these characteristics: where is the number of goals scored by a team’s offense."— Presentation transcript:
The Model Following these assumptions, I propose a hierarchical model with these characteristics: where is the number of goals scored by a team’s offense or allowed by a team’s defense, represent the strength of offense and defense for each individual team, is the total number of games played by each team, is the shape parameter, and is the scale parameter. Also, in this model it is assumed that any observed score can be split into offensive ‘scores’, and defensive ‘allowed scores’. In other words, For simplicity’s sake, in this model I assume and I use a data-set in which the number of games for each team is equal (pool play). The expanded forms of this model that relax these assumptions are explored further in my paper. What is a Hierarchical Model, and why use it? A hierarchical model is a model that is useful when a distributions parameters are assumed to have been drawn from a common underlying distribution: in this case, we assume that different teams offense and defense are defined by a common underlying Gamma distribution. The parameters that define this underlying distribution are known as the hyperparameters for the model. In this model’s case, the hyperparameters are and. Having a team’s strength come from an underlying distribution also makes sense, since all teams tend to draw from the same pool of players. While strength may vary between teams, there should be an underlying average skill level on which most players and teams fall. A Gamma distribution is chosen for two reasons: first, it is conjugate with the Poisson scoring process of soccer, and so makes the math involved dramatically easer, and second, the Gamma distribution is easy to justify logically when compared to any alternative distribution, such as a normal curve. Additionally, the hierarchical model is useful because it minimizes the number of parameters involved in any particular distribution, thereby preventing over-fitting to the data. My Assumptions about Soccer: Assumption 1: Offense and Defense are independent from each other. In any particular game of soccer, two teams attempt to score goals on the opposing team’s goal while defending their own goal from attack. Thus, we can roughly break down a team into an “offense” and “defense”, which act separately from each other. Assumption 2: Individual goals are independent of previous goals. In addition, for simplicity’s sake, we assume that any particular goal is independent from a previous goal; that is, the chance of scoring a goal when the score is 0-0 is independent of the chance of scoring a goal when the scores is 2-0. Assumption 3: Goal scoring follows a Poisson process. This assumption follows from our previous assumption. We assume that due to soccer’s low scores (i.e. low-probability of scoring) and due to the independence of goals from previous goals, we can model a team’s scores using a Poisson distribution. Assumption 4: Each team’s ability is not determined in a uniform fashion; rather, the team offense and defense capabilities are themselves drawn from an underlying distribution. This assumption is fundamental for our use of a hierarchical model. We must assume that ability throughout a particular soccer league is not uniformly distributed but is drawn instead from a shaped distribution (in this model, a Gamma distribution). Modeling Soccer: A Hierarchical Model Paul Goldsmith-Pinkham Swarthmore College, Department of Mathematics & Statistics About The Data Set: This poster’s data was drawn from results of the 2002 World Cup. The pool play data, in which each team played three games against teams in their pool, was used to train the model. Quarterfinal, semifinal and finals results were used in a qualitative fashion to determine the validity of the results. Generating the Parameters Using MCMC Due to the structure of the model, it is not possible to directly calculate the full-posterior distributions of and. Instead, it is possible to calculate the conditional distributions: Using these conditional distributions, it is possible to use a version of the Markov Chain Monte Carlo called the Gibbs sampler. The Gibbs sampler simply states that given a set of parameters that are related by conditional distributions, it is possible to use alternating conditional sampling to approach the marginal distribution of the parameters, and obtain estimates of the parameters. Verification In order to test the validity of the model, there are several measures that can be used. One way to verify if the cumulative distribution of scores generated using the estimated strength parameters matches up with the cumulative density (CDF) generated by the actual data. The results of this heuristic follow: Clearly, while the model is not perfect, it does a good job of matching up with the data. In fact, this lack of a perfect match is preferred, since the purpose of the hierarchical model is to prevent over-fitting to our data. Another way to check the model’s validity is to utilize the strength parameters to generate outcomes between teams playing in the Round of Sixteen, Quarterfinals, Semifinals, and Finals. Unfortunately, under this heuristic, the model did not perform as well, due in large part to the small amount of training data. Conclusions Using a hierarchical model, I have been able to take a small data-set (World Cup pool play) and estimate strength parameters for the teams. While the model does a good job of modeling the training data to which it is exposed, it does less of a good job predicting the winners in the next four rounds of the World Cup. The model predicts only two out of eight of the winners from the Round of Sixteen, and two out of four of the winners from the Quarterfinals. However, the model successfully predicts both of the winners in the Semifinals, but fails miserably in the Finals. The reason behind the lack of complete success can be linked to a small training set: due to the small amount of data, outliers, such as Germany’s 8-0 beatdown of Saudi Arabia, skew the parameters higher than they should be. With more data, the estimated parameters will settle more. This is pursued further in my paper. CDFY= 0Y <= 1Y <= 2 Y <= 3 Data Predicted Citations -Gelman et al, Bayesian Data Analysis, Chapman & Hall -Maher, M.J., ‘Modeling Association Football Scores’, Statistica Neerlandica, 36: Figure 1: The CDF of the data (black) and generated values (blue) Separate data into offense and defense using Draw and using Acknowledgements I would like to thank Phil Everson, for without him this project would not have been possible.