Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data Shawn Cicoria, John.

Similar presentations


Presentation on theme: "Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data Shawn Cicoria, John."— Presentation transcript:

1 Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data Shawn Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke Seidenberg School of Computer Science and Information Systems Pace University White Plains, NY, US

2 Background Titanic Disaster – April 15, 1912
1,502 passengers and crew perished out of 2,224[2] Researchers still try to identify chance of survival[2,3] [2] “Titanic: Machine Learning from Disaster,” Kaggle.com. [Online]. Available: https://www.kaggle.com/c/titanic-gettingStarted. [Accessed: 13-Dec-2013]. [3] Wiki, “Titanic.” [Online]. Available: [Accessed: 13-Dec-2013].

3 Kaggle.com Crowd sourcing and competition for Analytics and Data mining Online Presence Example Competition General Electric (GE) offering $200,000

4 Weka Waikato Environment for Knowledge Analysis
Open Source tool Collection of machine learning algorithms and analytical tools Cross Platform – Java based Primary authors – Researchers at University of Waikato NZ

5 Basic Premise What classes of passengers impacted the survivability for the Titanic Disaster? Sex Cabin Class Point of Departure Age

6 Source Data Kaggle (Kaggle.com) Titanic Disaster Competition
https://www.kaggle.com/c/titanic-gettingStarted Used Test Data set

7 Data Set – Coaxing for Weka
Original Data Data Modifications

8 Final Data Format Final CSV ARFF Format

9 J48 Classifier C.45 Based 81% correct classification
42nd in Kaggle if submitted !! ID3 (1979) C4.5(1993) C4.8 (1996?)  J48 C5.0 (commercial) J48: “top‐down induction of decision trees” Information gain  Amount of information gained by knowing the value of the attribute  (Entropy of distribution before the split) –(entropy of distribution after it) Claude Shannon, American mathematician and scientist 1916–2001

10 J48 Tree Diagram Sex largest impact Cabin Class Departure point

11 Simple K Means Clustering
Sex had clear clustering impact

12 Simple K Means Clustering
Cabin Class showed significant clustering 3rd class not so great

13 Simple K Means Clustering
Age Group Hard to distinguish if any Lowest influencer in J48

14 Simple K Means Point of Departure Southampton seems significant
We didn’t identify if departure was associated with Cabin Class – another study needed.

15 Simple K Means Point of Departure vs. Survived
Instance Colored by Class (1st, 2nd, 3rd) Show’s strong association between embark and class

16 Conclusions and Summary
Sex clearly had the most significant impact on the survival rate J48 classifier ~ 81% correctly classified instances Kaggle competition 43rd place.

17 Finally Weka is powerful, however;
Requires significant coaxing of the data into a more amiable format At first, we had chosen baseball statistics Became overwhelmed Baseball statistics were tossed out – very late in our project. Kaggle to the rescue Stumbled upon this dataset Simple manipulation had compatible ARFF format Demonstrated which classes of passengers had the greatest impact on survivability.

18 References [1] GE, “Flight Quest Challenge,” Kaggle.com. [Online]. Available: https://www.gequest.com/c/flight2-main. [Accessed: 13-Dec-2013]. [2] “Titanic: Machine Learning from Disaster,” Kaggle.com. [Online]. Available: https://www.kaggle.com/c/titanic-gettingStarted. [Accessed: 13-Dec- 2013]. [3] Wiki, “Titanic.” [Online]. Available: [Accessed: 13-Dec-2013]. [4] Kaggle, Data Science Community, [Online]. Available: [Accessed: 13-Dec-2013] [5] Weka 3: Data Mining Software in Java, [Online]. Available: [Accessed: 13-Dec-2013] [6] C4.5 Algorithm, Wikipedia, Wikimedia Foundation, [Online]. Available: [Accessed: 13-Dec-2013]


Download ppt "Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data Shawn Cicoria, John."

Similar presentations


Ads by Google