Presentation on theme: "Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data Shawn Cicoria, John."— Presentation transcript:
1Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition DataShawn Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren ClarkeSeidenberg School of Computer Science and Information SystemsPace UniversityWhite Plains, NY, US
2Background Titanic Disaster – April 15, 1912 1,502 passengers and crew perished out of 2,224Researchers still try to identify chance of survival[2,3] “Titanic: Machine Learning from Disaster,” Kaggle.com. [Online]. Available: https://www.kaggle.com/c/titanic-gettingStarted. [Accessed: 13-Dec-2013]. Wiki, “Titanic.” [Online]. Available: [Accessed: 13-Dec-2013].
3Kaggle.comCrowd sourcing and competition for Analytics and Data miningOnline PresenceExample CompetitionGeneral Electric (GE) offering $200,000
4Weka Waikato Environment for Knowledge Analysis Open Source toolCollection of machine learning algorithms and analytical toolsCross Platform – Java basedPrimary authors – Researchers at University of Waikato NZ
5Basic PremiseWhat classes of passengers impacted the survivability for the Titanic Disaster?SexCabin ClassPoint of DepartureAge
6Source Data Kaggle (Kaggle.com) Titanic Disaster Competition https://www.kaggle.com/c/titanic-gettingStartedUsed Test Data set
7Data Set – Coaxing for Weka Original DataData Modifications
9J48 Classifier C.45 Based 81% correct classification 42nd in Kaggle if submitted !!ID3 (1979)C4.5(1993)C4.8 (1996?) J48C5.0 (commercial)J48: “top‐down induction of decision trees”Information gain Amount of information gained by knowing the value of the attribute (Entropy of distribution before the split) –(entropy of distribution after it)Claude Shannon, American mathematician and scientist 1916–2001
10J48 Tree DiagramSex largest impactCabin ClassDeparture point
11Simple K Means Clustering Sex had clear clustering impact
12Simple K Means Clustering Cabin Class showed significant clustering3rd class not so great
13Simple K Means Clustering Age GroupHard to distinguish if anyLowest influencer in J48
14Simple K Means Point of Departure Southampton seems significant We didn’t identify if departure was associated with Cabin Class – another study needed.
15Simple K Means Point of Departure vs. Survived Instance Colored by Class (1st, 2nd, 3rd)Show’s strong association between embark and class
16Conclusions and Summary Sex clearly had the most significant impact on the survival rateJ48 classifier ~ 81% correctly classified instancesKaggle competition 43rd place.
17Finally Weka is powerful, however; Requires significant coaxing of the data into a more amiable formatAt first, we had chosen baseball statisticsBecame overwhelmedBaseball statistics were tossed out – very late in our project.Kaggle to the rescueStumbled upon this datasetSimple manipulation had compatible ARFF formatDemonstrated which classes of passengers had the greatest impact on survivability.