Presentation on theme: "Titanic Analytic model to predict survival in Titanic Disaster. By,"— Presentation transcript:
1Titanic Analytic model to predict survival in Titanic Disaster. By, Varun Kadekar
2Contents Problem description Data exploration Dependent Variable dependencySolution ApproachFinal Logit EquationValidation
3Problem DescriptionNeed to predict the probability of survival based on the data available.Dataset train.csv has test data and test.csv has validation data.Data analysis made on train.csv dataset must be applied on validation data to check for the correctness of the solution.
4Data Exploration/Preparation Check for outliers in the data. (Data looked good but for a few missing values in Age column)Treat the age column for missing data by substituting it with mean of the Age.If the Sex of the row with missing Age is ‘female’ then substitute the age by mean of Age of female passengers in the ship. The value is approximately 28. Likewise, for male passengers, its 31.
5Dependent Variable Dependency The correlation between Pclass and Survival shows that more people from Higher Class have survived and more people from the lower class have died.Stats below and charts in next slide.We can observe that 372 out of 491 from Pclass 3 have not survived the accident, and more than 50% of people from higher class have survived.SurvivedPclass123Sum809737254913687119342Total216184491891
6Dependent Variable Contd… The below chart gives a clear idea on number of survivors against those dead per every PClass.
7Dependent Variable Contd… Correlation between Age and survival shows more people below the age of 10 survived and the percentage of survival reduces with increase in age.
8Dependent Variable Contd… Correlation between Sex and survival shows more men have died.
9Solution ApproachThe correlation showed only following variables have significant impact.Pclass, SibSp, Age and Sex.Age is a continuous variable and hence we need to change it to categorical variable. Here is the approach I took: Age Bucket is ‘0’, if Age is between 0 and 10. Age Bucket is ‘1’ if Age is between 10 and 30. Age Bucket is ‘2’ if Age is greater than 30.Sex changed from character variable to numeric. Sex_Num is ‘0’ if Sex is ‘female’, else Sex_Num is ‘1’.
10Final Logit Equation Prob of Survival = eXP^M/(1+exp^M) The logit model run on the dependent variable with independent variables explained in previous slide, gives the below logit equation for probability of survival.PClass > Value of PClass in the input fileage_buck --> Age_buck value is '0' if 0<age<=10.Age_buck is '1' if 10 <age<=30. Age_buck is '2' if 30 <age<100.SibSp > Value from the inputSex_numeric --> This is a derived variable.Sex_numeric is '1' if sex in the input is 'Male'. Else Sex_numeric is '0'.Prob of Survival = eXP^M/(1+exp^M)where M = (PClass)*( )+(age_buck)*( )+(SibSp)*( )+(Sex_numeric)*( )
11ValidationApplied the logit equation against validation dataset, test.csv.Below chart shows the probability of survival. The model seems to have rightly predicted the probability of survival.We could observe that if the model has predicted the probability of survival to be more than 90%, then in real, they have indeed survived.As the prob of survival reduces, we can observe that more people have actually died.
12ValidationAdditional validation proof attached below. In the excel below, column P shows the predicted probability of Survival by the model.The column O shows the actual Survival variable from the myfirstforest.csv dataset.