Tuesday 2 February 2016

Kaggle Competition on Titanic

Titanic, or RMS Titanic, “was a British passenger liner that sank in the North Atlantic Ocean in the early morning of 15 April 1912 after colliding with an iceberg during her maiden voyage from Southampton, UK, to New York City, US”(Wikipedia).
The aim of the Kaggle project here, based on the data that is collected from the manifest of titanic, to predict who had a better chance of survival. There can be various concepts applied to the dataset like machine learning, logistic regression to determine based on the characteristic of each person, if he had a better chance at survival than the others in the ship.
The ship was a passenger line and had various classes and each class had varying capacity. 

As shown above the titanic had 1st class, 2nd class, 3rd class and the crew members.
To correctly predict the chance of an individual surviving depends on various factors. Let us look at the data first to get a clear picture of what features can be used to predict survival.



  • PassengerID:- This is just a sequential Id that is introduced to uniquely identify each person and does not hold any information.
  • Survived:- This is the key field which is used to predict various models. The survived column tells us if the person has indeed survived or not. 0 stands for dead and 1 stands for survived.
  • Based on the survival data of the training set, a model is created and is used on the test set.
  • Pclass:- Pclass tells us which class the person belongs to. 1 is for 1st class, 2 is for 2nd class and 3 is for 3rd class. To build a class based model, this column is very helpful to determine if being in a higher class has a better chance at survival.
  • Name:- This is the name of the individual. Although not much information can be retrieved from the name, the salutation(Mr, Miss, Mrs, Doctor) is an important factor here. It can be extrapolated to understand if salutation has any impact on survival i.e if doctors are more likely to survive than a Mr. Family names can also be grouped together to find relations between different families.
  • Sex:- Tells us the sex of the person. It is very useful in the data set and is used as a complimentary variable is certain model. The simplest model that you can come up with is which gender has a better chance at survival.
  • Age:- This is the age of the person. It is a significant variable as a lot of knowledge can be extrapolated. For eg the ages can be grouped together in an interval and used to predict which age group has more chances at survival.
  • Sibsp:- This is the number of siblings or spouses aboard the titanic. This is too is an important variable as it can be used to determine if a single person has more chance at survival than a large group of people.
  • Parch:- It is the same as Sibsp but this indicates the number of parents or children aboard the titanic. The same models that is used for Sibsp can be used of Parch.
  • Ticket:- It is the ticket number of the person. Although this is not a favorable attribute, the data itself has certain inquisitive values but has to be cleaned first.
  • Fare:- The money paid to receive the ticket. This can be used to determine the spending capacity of the person. This can also affect his survival as more spending usually means better status and maybe reservations to get in the lifeboat. A class type classification can help make more sense of this attribute.
  • Cabin:- Cabin in which the person belonged in. Although there are numbers in the data, the numbers can be omitted and prediction can be done solely on cabin locations as the lower cabins might have fewer chances at survival than the cabins that are above.
  • Embarked:- Shows where the person embarked on titanic. The three values that this column can take are C for Cherbourg, Q for Queenstown and S for Southampton. The embarked location can also be used as a supplement in the model as people that embarked from a particular location have a better chance at survival than people who embarked at a different location.
I will be looking at two solutions, one in R and the other in Python.
R
Model Used:- Ctree
Ctree typically runs and behaves like a decision tree. But a  Ctree essentially has more data and more values associated with it. Like a decision tree, it too has nodes with decision that are made and then split based on the different values. But ctree are multivariate, meaning that instead of creating the tree based on only one variable, it uses multiple variables or feature.  
In this model the user is trying create a c tree based on some of the randomly picked features.
Usually ctree is restricted based on the surrogate which is the levels that is created in the decision tree, but since the data is too low, this restriction is not imposed. The user creates the ctree based on the training set and then uses the created decision tree to predict the test dataset. Here the prediction is just a survived or not survived flag. The initial part of the code deals with eliminating the NA values in the age and fares and then moves on to counting the number of tickets and then plugging all of those in the model.

Original Code in black with explanation in Red:-
library(ggplot2)
Library used for plotting graphs
library(party)
Library for the ctree
library(caret)
Library for the Confusion matrix
library(e1071)
library(randomForest)
Library for the random forest

set.seed(1)
train <- read.csv("../input/train.csv", stringsAsFactors=FALSE)
test  <- read.csv("../input/test.csv",  stringsAsFactors=FALSE)
Importing both the test and the training data without converting string data as factors
train$Cat <- 'train'
test$Cat <- 'test'
test$Survived <- NA
full <- rbind(train,test)
Creates a separation between the test and the train data by introducing one column with test and train as text in them. Then a new data frame is created that combines test and train.
train$Age[grepl(" Master\\.",train$Name) & is.na(train$Age)] <- mean(full$Age[grepl(" Master\\.",full$Name) & !is.na(full$Age)])
test$Age[grepl(" Master\\.",test$Name) & is.na(test$Age)] <- mean(full$Age[grepl(" Master\\.",full$Name) & !is.na(full$Age)])
train$Age[grepl(" Miss\\.",train$Name) & is.na(train$Age)] <- mean(full$Age[grepl(" Miss\\.",full$Name) & !is.na(full$Age)])
test$Age[grepl(" Miss\\.",test$Name) & is.na(test$Age)] <- mean(full$Age[grepl(" Miss\\.",full$Name) & !is.na(full$Age)])
train$Age[grepl(" Mr\\.",train$Name) & is.na(train$Age)] <- mean(full$Age[grepl(" Mr\\.",full$Name) & !is.na(full$Age)])
test$Age[grepl(" Mr\\.",test$Name) & is.na(test$Age)] <- mean(full$Age[grepl(" Mr\\.",full$Name) & !is.na(full$Age)])
train$Age[grepl(" Mrs\\.",train$Name) & is.na(train$Age)] <- mean(full$Age[grepl(" Mrs\\.",full$Name) & !is.na(full$Age)])
test$Age[grepl(" Mrs\\.",test$Name) & is.na(test$Age)] <- mean(full$Age[grepl(" Mrs\\.",full$Name) & !is.na(full$Age)])
train$Age[grepl(" Dr\\.",train$Name) & is.na(train$Age)] <- mean(full$Age[grepl(" Dr\\.",full$Name) & !is.na(full$Age)])
test$Age[grepl(" Dr\\.",test$Name) & is.na(test$Age)] <- mean(full$Age[grepl(" Dr\\.",full$Name) & !is.na(full$Age)])
Here the mean of age of the full dataframe which now contains the entire test and train is used to calculate the mean age. This age is then used to plugin into the NA in the test and train data separately. This is the correct approach as thee test and train data together will give the correct mean as compared to only using test or train. Please note that the mean is dependent on what the salutation is that is the mean of ages with the salutation MR is calculated and is inserted in the test and train data of the NA which has the salutation MR and not to other salutation.
agemodel <- glm(Age ~ Pclass + Fare + Pclass:Fare,data=train)
This creates a generalized linear model where the Age is used as the response and pClass, fare and pclass and fare combined as the predictor. The train data is used to make this model.
train$Age <- ifelse(is.na(train$Age),predict(agemodel,train[is.na(train$Age),]),train$Age)
test$Age <- ifelse(is.na(test$Age),predict(agemodel,test[is.na(test$Age),]),test$Age)
Here another alternative method to calculate the age for the NA is done. Here the previously made generalized linear model is used to predict the age for the missing values based on the test data without the NA.
train$Fare[is.na(train$Fare)] <- median(train$Fare, na.rm=TRUE)
test$Fare[is.na(test$Fare)] <- median(train$Fare, na.rm=TRUE)
Putting the median value for the fare in place of the median.
train$Title<-sapply(train$Name,function(x) strsplit(x,'[.,]')[[1]][2])
test$Title<-sapply(test$Name,function(x) strsplit(x,'[.,]')[[1]][2])
Creating a new field with name title by stripping off the salutation using the logic that the salutation exist between the , and the . which has the data we need
full <- rbind(train,test)
full$Title<-gsub(' ','',full$Title)
Removing any blank spaces in the title especially the blanks calculated when the separation was done
full$Title[full$Title %in% c('Capt','Col','Don','Sir','Jonkheer','Major','Master')]<-'Mr'
full$Title[full$Title %in% c('Lady','Ms','theCountess','Mlle','Mme','Ms','Dona')]<-'Miss'
Generalize the type of salutation into a Mr or Miss catagory
full$Cabin <- substr(full$Cabin,1,1)
full$Cabin      <- as.factor(full$Cabin)
full$Title      <- as.factor(full$Title)
Create a factor variable for the cabin by using a substring on the cabin value.
train <- full[full$Cat == 'train', ]
test <- full[full$Cat == 'test', ]
Separate the two train and test data frame into their original form.
full$col1<-1
agreg <- aggregate(col1~Ticket, data=full, FUN=sum)
Aggregating the col1 field which is set to one, then sum based on the ticket to find out similar ticket number by summing up col1. This will given the number of of the same ticket in the entire set.
for (i in 1:length(agreg[,1]) ) {ifelse(agreg[i,2]<3,0,full$Ticket[full$Ticket==agreg[i,1]]<-0)}
Iterating through the agreg and setting 0 if the count is less than 3 or else put the value of the ticket in the field. Hence we will have only ticket number that are greater than 3.
train$Ticket <- factor(full$Ticket)[1:891]
test$Ticket <- factor(full$Ticket)[892:1309]
Substituting back the factor variable back into the train and test set
extractFeatures <- function(data) {
  features <- c("Survived",
                "Cabin" ,
                "Title",
                "Pclass",
                "Age",
                "Ticket",
                "Fare",
                "SibSp",
                "Sex")
  fea <- data[,features]
 
  #fea$Embarked[fea$Embarked==""] = "S"
  fea$Sex      <- as.factor(fea$Sex)
 
  #fea$Embarked <- as.factor(fea$Embarked)
  fea$Survived <- as.factor(fea$Survived)
  return(fea)
}
This function is created to extract the features from the train set. Since we need these features from the training set to put in our model, we are simplifying the process by creating a function. This will subset only the features and create factor variable for relevant fields.
extractFeatures2 <- function(data) {
  features <- c("Cabin" ,
                "Title",
                "Ticket",
                "Pclass",
                "Age",
                "Fare",
                "SibSp",
                "Sex"
  )
  fea <- data[,features]
  #fea$Embarked[fea$Embarked==""] = "S"
  fea$Sex      <- as.factor(fea$Sex)
  fea$Cabin      <- as.factor(fea$Cabin)
  fea$Title      <- as.factor(fea$Title)
  #fea$Embarked <- as.factor(fea$Embarked)
  return(fea)
}
This is the same fuction as the previous one but for the test data. It performs the same function as the previous one.
thedata <- extractFeatures(train)
Extracting the features from training data and putting them in the thedata variable.
myctree <-  ctree(Survived ~ Pclass +Sex +SibSp +Cabin+Title+Fare+ Ticket+Pclass:Sex + Age, data = extractFeatures(train))
model <- myctree
This is the main step of the entire code. Here is where the ctree is created. The ctree is created using survived as the response and Pclass +Sex +SibSp +Cabin+Title+Fare+ Ticket+Pclass:Sex + Age as the predictor. The training data is used to predict this tree. As explained above, the ctree is created using probability and the nodes here are the responses. This creates a comprehensive tree that categorizes various features and their impact on survival.
trainsubtest <- data.frame(PassengerId = train$PassengerId)
Creating a new subset data frame to output the result.
trainsubtest$Survived <- predict(model, extractFeatures(train))
pred <- predict(model,extractFeatures(train))
This is just a test code that the user must have made so ignoring it.
output <- confusionMatrix(train$Survived,trainsubtest$Survived)
print(output)
submission <- data.frame(PassengerId = test$PassengerId)
submission$Survived <- predict(model, extractFeatures2(test))
Based on the model made above, a prediction is made based on the test data. This will output if based on the features selected by the ctree and the probability of the survival, if the new person will survive or not.
write.csv(submission, file = "1_ctree_submission.csv", row.names=FALSE)
Writing the solution back into a csv file for submission.      

This solution gives a value of 0.7994 which put from the low 1800 to jump up to 800’s. J      

Python solution:-
Model Used:- Logistic Model
Logistic Model uses probability to determine if a particular event is about to happen or not. It is capable of doing this by using the training set data to generate probability of each event based on certain values of the feature. Then it sets a particular threshold that would differentiate between two different outcomes. Then based on this prediction model, you can apply to the test set to check based on the model created, what is the probability that the event will occur. If it is greater than the threshold then the even will occur and if it is less than the threshold then it wont.


Code:-
import numpy as np
For Numerical computation
import pandas as pd
For data frame manipulation
from sklearn.calibration import CalibratedClassifierCV
Not used
from sklearn.ensemble import RandomForestClassifier
For the Random forest generation
from sklearn.grid_search import GridSearchCV
Not Used
from sklearn.linear_model import LogisticRegression
For Logistic regression
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import BernoulliRBM
Not Used
from sklearn.pipeline import Pipeline
For creation for pipeline
from sklearn.preprocessing import PolynomialFeatures, Imputer
assisting in the logistic regression
from patsy import dmatrices, dmatrix
For creating confusion matrix

#Print you can execute arbitrary python code
df_train = pd.read_csv("../input/train.csv", dtype={"Age": np.float64}, )
df_test = pd.read_csv("../input/test.csv", dtype={"Age": np.float64}, )
Importing the data and create a pandas data frame
# Drop NaNs
df_train.dropna(subset=['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'], inplace=True)
Here directly the NA are dropped and only the rows that have no NA are considered. The entire row is dropped if even one value has a NA in it.
print("\n\nSummary statistics of training data")
print(df_train.describe())
Outputs the summary of the training data
# Age imputation
df_train.loc[df_train['Age'].isnull(), 'Age'] = np.nanmedian(df_train['Age'])
df_test.loc[df_test['Age'].isnull(), 'Age'] = np.nanmedian(df_test['Age'])
Redundant as NAN values are already removed
# Training/testing array creation
y_train, X_train = dmatrices('Survived ~ Age + Sex + Pclass + SibSp + Parch + Embarked', df_train)
This creates two different matrices that contain different values. X_train will have the response for the model and the y_train will have the predictor. This is necessary in the prediction model.
X_test = dmatrix('Age + Sex + Pclass + SibSp + Parch + Embarked', df_test)
Convert the df_test in a similar way but dmatrix converts it onto 1 dimension model rather than a two dimensional.
# Creating processing pipelines with preprocessing. Hyperparameters selected using cross validation
steps1 = [('poly_features', PolynomialFeatures(3, interaction_only=True)),
          ('logistic', LogisticRegression(C=5555., max_iter=16, penalty='l2'))]
Here we are generating a step for a pipeline. A pipeline essentially is a type of structure where we can introduce sequential transformation to a particular data set. Here we are first calculating the polynomial feature  and then creating a logistic regression. This is kind of an abstract way of defining a model whoch can be used to different dataset without having to use again similar to object oriented programming
steps2 = [('rforest', RandomForestClassifier(min_samples_split=15, n_estimators=73, criterion='entropy'))]
Similar to the previous one but with a randomforestclassifier.
pipeline1 = Pipeline(steps=steps1)
pipeline2 = Pipeline(steps=steps2)
Both the pipelines are now created.
# Logistic model with cubic features
pipeline1.fit(X_train, y_train.ravel())
print('Accuracy (Logistic Regression-Poly Features (cubic)): {:.4f}'.format(pipeline1.score(X_train, y_train.ravel())))
Here we are trying to fit the training data using the first pipeline. The estimator for the fitting of the data is y_train. The ravel is used to convert y_train to a 1 d array so that it is consistent with x_train. We also want to see the accuracy with which the logistic regression is able to predict the outcome. We are using this just to confirm which method to go with.
# Random forest with calibration
pipeline2.fit(X_train[:600], y_train[:600].ravel())
calibratedpipe2 = CalibratedClassifierCV(pipeline2, cv=3, method='sigmoid')
calibratedpipe2.fit(X_train[600:], y_train[600:].ravel())
print('Accuracy (Random Forest - Calibration): {:.4f}'.format(calibratedpipe2.score(X_train, y_train.ravel())))
The same as above only difference is here we are creating a random forest classification. After creating the classification we are checking the score of this as well. Since the accuracy is less than that of Logistic regression we choose Logistic regression
# Create the output dataframe
output = pd.DataFrame(columns=['PassengerId', 'Survived'])
output['PassengerId'] = df_test['PassengerId']

# Predict the survivors and output csv
output['Survived'] = pipeline1.predict(X_test).astype(int)
output.to_csv('output.csv', index=False)
Creating the output dataset buy predicting for the test data X_test, by using logistic regression which is used in pipeline one.

3)
Comparing both the solution via predication capability, the ctree solution has a better score of 0.7994 as compared to the logistic solution of 0.7655. The main reason being is that the logistic solution ignored all of the NAN values of the data set. This definitely reduced the accuracy with which the probability could be calculated. The ctree had a better score because it had a much sophisticated method of substituting for the NA and that the model also took into many other features than logistic model.



1 comment:

  1. Casinos in Kansas City - MJH Hub
    Casinos Near 대전광역 출장마사지 Jackson City 동해 출장샵밀양 출장안마 Jackson City Casinos Near 평택 출장마사지 Jackson City, IN · Casinos Near Jackson City – Washington County 구리 출장마사지 Jackson City Council Bluffs – Davenport, IA · Casinos Near Memphis, TN

    ReplyDelete