Presentation on theme: "Analytic model to predict survival in Titanic Disaster. By, Varun Kadekar"— Presentation transcript:
Analytic model to predict survival in Titanic Disaster. By, Varun Kadekar
Contents Problem description Data exploration Dependent Variable dependency Solution Approach Final Logit Equation Validation 2
Problem Description Need to predict the probability of survival based on the data available. Dataset train.csv has test data and test.csv has validation data. Data analysis made on train.csv dataset must be applied on validation data to check for the correctness of the solution. 3
Data Exploration/Preparation Check for outliers in the data. (Data looked good but for a few missing values in Age column) Treat the age column for missing data by substituting it with mean of the Age. If the Sex of the row with missing Age is ‘female’ then substitute the age by mean of Age of female passengers in the ship. The value is approximately 28. Likewise, for male passengers, its 31. 4
Dependent Variable Dependency The correlation between Pclass and Survival shows that more people from Higher Class have survived and more people from the lower class have died. Stats below and charts in next slide. We can observe that 372 out of 491 from Pclass 3 have not survived the accident, and more than 50% of people from higher class have survived. 5 SurvivedPclass 123Sum Total
Dependent Variable Contd… 6 The below chart gives a clear idea on number of survivors against those dead per every PClass.
Dependent Variable Contd… 7 Correlation between Age and survival shows more people below the age of 10 survived and the percentage of survival reduces with increase in age.
Dependent Variable Contd… 8 Correlation between Sex and survival shows more men have died.
Solution Approach 9 The correlation showed only following variables have significant impact. Pclass, SibSp, Age and Sex. Age is a continuous variable and hence we need to change it to categorical variable. Here is the approach I took: Age Bucket is ‘0’, if Age is between 0 and 10. Age Bucket is ‘1’ if Age is between 10 and 30. Age Bucket is ‘2’ if Age is greater than 30. Sex changed from character variable to numeric. Sex_Num is ‘0’ if Sex is ‘female’, else Sex_Num is ‘1’.
Final Logit Equation 10 The logit model run on the dependent variable with independent variables explained in previous slide, gives the below logit equation for probability of survival. PClass --> Value of PClass in the input file age_buck --> Age_buck value is '0' if 0 Value from the input Sex_numeric --> This is a derived variable. Sex_numeric is '1' if sex in the input is 'Male'. Else Sex_numeric is '0'.
Validation 11 Applied the logit equation against validation dataset, test.csv. Below chart shows the probability of survival. The model seems to have rightly predicted the probability of survival. We could observe that if the model has predicted the probability of survival to be more than 90%, then in real, they have indeed survived. As the prob of survival reduces, we can observe that more people have actually died.
Validation 12 Additional validation proof attached below. In the excel below, column P shows the predicted probability of Survival by the model. The column O shows the actual Survival variable from the myfirstforest.csv dataset.