top of page
Search

Titanic Wreck Survival Prediction(Kaggle Score : 0.78947,Rank : 1110/14207)

  • charithayrs1997
  • Oct 7, 2022
  • 2 min read

As a part of Data Mining Course, as a first step towards learning ML and usage of Kaggle, I've built a machine learning model that predicts which passengers survive during the Titanic ship wreck. Dataset split into two groups(provided in Kaggle), training set with ground truth(for building model) and test set without ground truth(for prediction and submission) was used.


Data Dictionary :

From the above features we've to predict the survival attribute using all the other attributes by building a model. It is observed that pclass(passenger ticket class), sex(gender), sibsp(number of siblings and spouse), parch(number of parents and children) directly correlate to survival in the titanic ship wreck as it is seen that 75% of the survivors are women, people with their family members and higher passenger class are probable of surviving. Therefore using these four features, Random Forest Classifier is run by following the kaggle tutorial.


Execution :


Using Pandas library's read_csv function both training dataset and testing data set are loaded as shown below.



Percentage of male and female among the survivors are calculated as shown below.

RandomForestClassifier algorithm is used for building model and predicting as shown below. Here n_estimators indicates the number of trees in the forest, "gini" is used to measure the quality of the split, random_state is for controlling the randomness.


Contribution :

Following the Kaggle's tutorial, accuracy achieved with Random Forest Classifier Algorithm was 77.511%. In order to improve this I followed three approaches.

  • I tried changing the hyperparameters of Random Forest Classifier and at fifth trial, arrived at the accuracy of 78.947 with the hyperparameters of (n_estimators=400, criterion="gini", max_depth=8, max_features="auto", random_state=100). Please find the python notebook here.

  • I tried the prediction with SVM, Logistic Regression and Decision Tree Classifier and observed 77.751, 77.511, 76.555 accuracies respectively. I found that SVM performed well but not better than the best hyperparameters applied with Random Forest Classifier.

  • Observing that age, embarked features are also correlated with the survival, I've divided the available ages into age intervals and marked them with numerical category. I've replaced the character categorical values of embarked attribute with numerical values and then performed Random Forest Classifier adding these two features also and found out that the accuracy was 78.708%.

Among the results I've obtained 78.947% was the best accuracy with mentioned hyperparameters of Random Forest Classifier.



 
 
 

Comments


bottom of page