Analysis of World Cup Finals. Outline Project Understanding – World Cup History Data Understanding – How to collect the data Data Manipulation – Data.

Slides:



Advertisements
Similar presentations
Sales Forecasting using Dynamic Bayesian Networks Steve Djajasaputra SNN Nijmegen The Netherlands.
Advertisements

CP Chapter 4 Schedule Planning.
Hypothesis Tests: Two Independent Samples
Chapter 11: The t Test for Two Related Samples
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Random Forest Predrag Radenković 3237/10
Brief introduction on Logistic Regression
From Decision Trees To Rules
Data Analysis of Tennis Matches Fatih Çalışır. 1.ATP World Tour 250  ATP 250 Brisbane  ATP 250 Sydney... 2.ATP World Tour 500  ATP 500 Memphis  ATP.
Soccer Soccer is one of most popular sports in the world.
Sampling distributions. Example Take random sample of students. Ask “how many courses did you study for this past weekend?” Calculate a statistic, say,
Machine Learning in Practice Lecture 3 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Introduction to Data Mining with XLMiner
Introduction to Predictive Learning
Data Mining Techniques Outline
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Sampling and Experimental Control Goals of clinical research is to make generalizations beyond the individual studied to others with similar conditions.
Weka solution for the 2004 KDD Cup Protein Homology Prediction task Bernhard Pfahringer Weka Group, University of Waikato, New Zealand.
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖.
PROBABILITY AND SAMPLES: THE DISTRIBUTION OF SAMPLE MEANS.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Lot-by-Lot Acceptance Sampling for Attributes
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
Chapter 7 Probability and Samples: The Distribution of Sample Means
Chapter 5 Data mining : A Closer Look.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Comparison of Classification Methods for Customer Attrition Analysis Xiaohua Hu, Ph.D. Drexel University Philadelphia, PA, 19104
Data Mining Techniques
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Hypothesis testing – mean differences between populations
Chapter 9 Business Intelligence and Information Systems for Decision Making.
WEKA – Knowledge Flow & Simple CLI
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.
Definition The FIFA World Cup, often simply the World Cup, is an international association football competition contested by the senior men’s national.
ICDM 2003 Review Data Analysis - with comparison between 02 and 03 - Xindong Wu and Alex Tuzhilin Analyzed by Shusaku Tsumoto.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Decision Trees. MS Algorithms Decision Trees The basic idea –creating a series of splits, also called nodes, in the tree. The algorithm adds a node to.
PROCESSING OF DATA The collected data in research is processed and analyzed to come to some conclusions or to verify the hypothesis made. Processing of.
Course Review FORE 3218 Course Review  Sampling  Inventories  Growth and yield.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Designing multiple biometric systems: Measure of ensemble effectiveness Allen Tang NTUIM.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
WEKA Machine Learning Toolbox. You can install Weka on your computer from
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Click me -> Best of soccer The graph* to the right shows the Top 10 Soccer Teams, Players, Goalkeepers, and Tournaments. *Based on statistics.
Confidence Interval Estimation For statistical inference in decision making: Chapter 9.
Copyright © 2009 Pearson Education, Inc. 8.1 Sampling Distributions LEARNING GOAL Understand the fundamental ideas of sampling distributions and how the.
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Population vs Sample Population = The full set of cases Sample = A portion of population The need to sample: More practical Budget constraint Time constraint.
Story Board : UEFA EURO 2016 Knockout stages from the eyes of data.
FIFA U-17 World Cup.
Project 1: Text Classification by Neural Networks
Opening Weka Select Weka from Start Menu Select Explorer Fall 2003
CSCI N317 Computation for Scientific Applications Unit Weka
Calculating Probabilities for Any Normal Variable
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Chapter 13 - Confidence Intervals - The Basics
Chapter 14 - Confidence Intervals: The Basics
A Data Partitioning Scheme for Spatial Regression
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Analysis of World Cup Finals

Outline Project Understanding – World Cup History Data Understanding – How to collect the data Data Manipulation – Data Cleaning – Feature Selection – Missing Values Handling – Discretization and Normalization Data Visualization Modelling – Classification of matches – Regression of matches’ scores Association Rule Learning Conclusion

Project Understanding Domain is football. Peak of Football Competitions Held by FIFA Why is World Cup Finals so important? – Viewer’s Perspective – Player’s Perspective Nationalism

World Cup History

Most Successful countries

Project Understanding Purpose of The Project – Try to estimate match results based on previous world cup finals’ mathces – Try to estimate match scores based on previous world cup finals’ matches’ scores – Try to find meanningfull rules – Figure out which attributes are more important on winning a match.

Data Understanding No available data in the Internet! No available data organization for analysis. The very best data I could found

Data Understanding The data is far from being enough! New attributes introduced Population, Average Income, Host Information, PastSuccess, CurrentForm, FIFA rank, CurrentClubForm, Match Status Information gathered from FIFA, UEFA, CONMEBOL, CAF, AFC Information is not enough, needs revision

Population Larger population may indicate better national team performance Larger means larger talent pool to choose from

Average Income How does average income affects national team performance? Football is the poor’s sport or riche’s ?

Host Does hosting affect country’s performance ? In fact, it does

Past Success Reflects the world cup achievements of a country until specified year. It brings “BIG TEAM” identity. Points Calculation Table

Past Success World Cup History Table

Current Form Reflects the achivement in last world cup final and the other biggest associated competition’s achivement ( e.g. European Cup, Copa America, Africans Cup, Asian Cup) Points calculated similar to past success, but each competition has different weights

Current Form Calculation Table

Current Form European Cup Table

Club Form Reflects the form of the clubs in a particular country Top 25 clubs according to FIFA found Then for each club, points assigned to corresponding country Based on Champions League, Copa Libertadores, UEFA Cup, League Success

Club Form Calculation Table

FIFA Rank Reflects the success of each country in last five years along with its club success, league success, international success in all friendly and official qualification and finals matches Lowest rank means most successful country

Data Manipulation- Cleaning USA vs United States No longer existing countries – Soviet Union – Yugoslavia Missing Values – Before 1991, FIFA Ranks and Club Forms Missing – Thus, data after 1994 World Cup Final is used

Feature Selection # attributes are low No algorithm used Selection done using expert knowledge and some statistical tools

Population vs Success

GDP vs Success

Feature Selection Remove population and GDP 13 attributes left

Missing Values Handling Two different tables One with no missing value handling operation – Simply remove rows with missing values The other with using average for missing values

Discretization - Normalization Discretization is done for decision tree and bayesian classifiers Normalization is done for SVM, Neural Network and k-NN classifiers

Data Visualization Correlation Matrix

Data Visualization Box Plot

Data Visualization Scatter Plot: Host vs Result

Data Visualization Scatter Plot: FIFA Rank1 vs Result

Modelling For modelling each classiffier is tested with following different parameters – 5 Fold Cross Validation – 10 Fold Cross Validation – Random Sampling – Stratified Sampling Also discretization and normalization are done before classification

KNIME - Modelling

Modelling Results of Modelling

Modelling Decision Tree gives best result! K-NN gives second best, while others have poor classification Stratified Sampling is generally better than random sampling 10 fold is generally better than 5 fold

Decision Tree

Tree model indicates the order of attribute importance as follows: 1- Club Form 2- Current Form 3- Past Success 4- Host Information

Regression Tree For Regression, regression tree is used in WEKA – The mean absolute error for score1 is 0.43 – The mean absolute error for score2 is 0.38 The errors are high as expected.

Association Rules Rules are extracted using WEKA and Knime after discretization of the data – Current Form_2='(-inf-88]' Result=H 439 ==> Club Form_2='(-inf-15.5]' 439 conf:(1) (WEKA)conf:(1) – FIFA Rank_1='(-inf-20.5]' Result=H 374 ==> Club Form_2='(- inf-15.5]' 371 conf:(0.99) (WEKA)conf:(0.99) – Current Form_1='(26-inf)' Result=H 359 ==> Club Form_2='(-inf-15.5]' 354 conf:(0.99) (WEKA)conf:(0.99) – , , ,"0-10_Past Success_2","<---","[H_Result]“ (KNIME) – , , ,"0_Club Form_2","<---","[H_Result]“ (KNIME)

Conclusion It is hard to collect data manually! The attributes I come up with are better than FIFA rank, so be careful FIFA To sum up, club form is the most important factor for a country to be successful in World Cup Finals After that, current form, past success and host information come

Golden Generation

Future Work The more number of players from a same club in a particular country, the more likely that country will be successful in World Cup Finals Another data analysis would be on that issue

THANK YOU FOR LISTENING. ANY QUESTIONS?