Srinivasan Iyer: 2016

Friday, 1 April 2016

Watson Analytics Competition Dashboards

Here's a link to our submission for the Watson Analytics Compeition!

https://www.youtube.com/watch?v=1I51Sprv5SA

Tuesday, 9 February 2016

What's Cooking - Kaggle Competition

INTRODUCTION

“What’s cooking?” is a kaggle competition with data provided by yummly.com. The objective of this competition is to efficiently categorize the ingredients used in a recipe to a cuisine category. Every cuisine has its own list of ingredients that is either unique to that specific cuisine or it can be some general ingredients that are used in almost all the cuisines.The objective of the competition is to predict the cuisine by looking at the different ingredients that is used to make that cuisine. The training data gives us a specific set of recipes and the cuisine that the particular recipe belongs too along with a unique ID. The submission file that is to be submitted is on a test data that should contain the ID of the recipe and the cuisine that we have predicted that it belongs to.

DATA

The data that is provided for this competition is in the form of JSON files. The JSON file format is favorable in this case as the data is mainly textual data and that it has a non-standard format of data. It is non-standard in the sense that the number of ingredients that is associated with each of the recipe varies from recipe to recipe and from cuisine to cuisine. Hence JSON is more favorable here than CSV or XLSX. There are total of 3 files
TRAIN.JSONThis is the training data that has a total of 39774 recipes along with a total of 20 cuisines in all. Below is a screenshot of a single recipe.
As we can see above we have the recipe as the single unit of the data. Inside each recipe we have 3 main fields.1) ID:- This is the ID that is used to uniquely identified each and every recipe. It doesn’t seem to be sequential but randomly sampled to make sure that there is no selection bias.2) Cuisine:- This signifies the type of cuisine that the particular recipe belongs too. There are a total of 20 cuisine types. This number is important to know as these are the classes or groups into which our predictions can fall into.3) Ingredients:- These are the ingredients that are used to make this particular recipe. Every single recipe has a different number of ingredients and different ingredients in each recipe. This will be the input for our model. Based on these ingredients we will build our models to determine given the set of ingredients, what cuisine it actually belongs to.
TEST.JSONThis is the test data set. This is the dataset for which we have to determine what cuisine it really belongs to. It has a total of 9944 recipes and only two variables. Below is the screenshot of the test file.
Below is what the fields contain1) ID:- Same characteristics as the ID column in training data. These are new recipes that are not present in the training data. It also doesn’t say which cuisine it belongs to which is what we need to figure out.2) Ingredients:- Same characteristics as the ingredients column in the training data.
SAMPLE_SUBMISSION.CSVThis is the sample submission file that shows us the format in which the prediction of the test set is to be submitted. This is given because since Kaggle calculates the percentage of how many of the recipes you got right automatically; it needs to be in a predetermined format which is given below.
This format is to be saved in a csv format as it is fairly straightforward data output. The column “ID” has the ID column same as that of the test.json file. The “cuisine” is the predicted column that would be outputted from the model that we will be building.

POSSIBLE APPROACHES AND UNDERSTANDING OF DATA

This problem is clearly a classification problem. We have to classify every recipe into a particular cuisine. We can consider each and every cuisine as a class and try to predict which recipe can fall into a particular class. There can be multiple approaches to the problem. To understand this we need to first understand the data in particular. I have created all my models in R. I will not be including any install statements as they are irrelevant to this solution set. So lets first import the data using a library called jsonlite. This helps to import the said data and store it in a R variable.library(jsonlite) options(stringsAsFactors = FALSE) test <- fromJSON("test.json", flatten = TRUE) train <- fromJSON("train.json", flatten = TRUE)We have set the options variable here as stringsAsFactors = FALSE because we do not want to convert any texual data into a factor as we are not sure if it is the right way to proceed. We go ahead and import the data into a variable called test for the the test.json and train from train.json. A quick look into this variable will show you just how the data is stored in R. Below is the output that we receive when we look into test variable.Looking at the train data we can understand that there are a total of 20 different cuisines that are present but the distribution amongst these 20 cuisines is not even. We can see that most of the cuisines belong to “Italian”, “mexican” or “southern_US”. All of the other categories have a lot less cuisines as compared to these cuisines.Below is the histogram that aptly shows the distribution of the cuisine in the data set. This is created using the library called ggplot2. The code is as followsqplot(traincleaned$cuisine, geom="histogram")
From this we can proceed with a naïve model where we can assume that all the recipes are Italian and then submit it. Although this is an extremely simple model with not much prediction involved, it gives us a very low score of 0.19. I will not be considering this model because of these reasons.We can also see that there are a lot of errors in the data that can be of hindrance to us. Like there are ingredients like “all-spices” where the “–“can be of a problem while running some models. Also there are some repetitions with minor changes in the name like “fresh ginger” and “fresh ginger root”. Although both of them are technically the same ingredient, the model will consider them as two different entities. This will cause problems as due to this change in ingredient there is a possibility that it might not get rightly classified into the right category. We will have to clean this ingredient list to remove this problem.
There are also certain other parameters that are useful. For example there are certain ingredients that have a high probability to being in a particular cuisine. For eg tortilla is heavily related to Mexican cuisine or garam masala with indian cuisine. We can find out these ingredients by using our model. Below is a diagram that lists out the various ingredients that have more probability to be in a particular cluster, which is nothing but a cuisine in this case. The code for this particular graph is in one of the solutions below as to get this graph we need to have a model in place.

CLEANING OF DATA

As I mentioned above, there are lot of inconsistencies in the data and needs to be cleaned before we can do any type of operations on it. STEP 1The inconsistencies in the ingredients are present both in the test and the train data. So to remove them we need to combine the ingredients together into one single variable.combined <- rbind(train[,c("id", "ingredients")],test) This will combine the ID and ingredients from the training data and the same from test and put it into combined. This is just for easier computation. Once we clean it we will split them back into their original sets which is train and test.
STEP 2As mentioned previously, the “-” in the data will cause problems in the modeling part (especially the XGboost) so we need to remove that. We will be using the lapply function from the base package of R. we will iteratively look through the variable that holds the ingredients and remove any – and replace it with a blank space. I will not be removing it as it will mean combining both the words. We will iterate using the function call in lapply to remove all the “-” in the ingredients part.combined$ingredients <- lapply(combined$ingredients, FUN=function(x) gsub("-", " ", x)) Here gsub will substitute the “-” for the “ ” and function will iterate through the variable.
STEP 3Now we will use the text mining library that is called tm. It is exclusively used in text mining related data frames where we need to mine and format the data. To begin with we need to convert the variable to a corpus. Corpus is a the base data type for the package tm before we do any type of analysis. Corpus is just a “collection of text documents and is an abstract concept”.(Source) testcorpus <- Corpus(VectorSource(combined$ingredients)) STEP 4There are certain stop words like in certain ingredients like “sweet and sour sauce” which is of no significance and should be removed. Using the same tm package we can remove the stopwords which are present in English like and, or etc by the below code.cleanedtest <- tm_map(testcorpus,removeWords,stopwords("english")) STEP 5Now there are certain areas where there are some white space that needs to be removed as two different terms will be considered the same like “pepper” and “pepper ” will be considered as two different ingredient. Hence we need to remove the white space. It is worth mentioneing that the white space between two different words like “black pepper” will not be removed as they are two separate words and the space in between does not constitute as a whitespace.cleanedtest1 <- tm_map(cleanedtest,stripWhitespace)
STEP 6This is the most important step of all. There are certain values in the ingredient which are basically the same ingredient but are introduced as different ones. For eg, thigh and thighs are technically the same word but will be considered as two different ingredient. To prevent this we need to stem the document. Stemming will help us eliminate redundancies that are prevalent in the ingredients set and give us the root word only.testdtm2 <- tm_map(cleanedtest1, stemDocument) STEP 7Now we have to convert this entire list of words and convert it into a data matrix. This data matrix is created in a way where each of the ingredients are used as columns and each of the recipe as rows. If a particular recipe has a particular ingredient, then the value for that recipe and that column is marked as 1 if it is present and 0 if not present. In this way we get a standardized data frame that we can use to work further.testDTM<- DocumentTermMatrix(testdtm2) STEP 8Next we have to eliminate values that are not that repetitive enough. These ingredients occur in almost only one or two recipes but are pretty inconsistent and are just a hindrance to our model. Although since we do not want to remove too many ingredients, we set the threshold very high so that only ingredients that are way to less as compared to other ingredients are removed. Here 0.99 signifies the testDTM1 <- removeSparseTerms(testDTM, 0.99) We can also see the most frequent terms here by using the below code.findFreqTerms(testDTM1, 2000) STEP 9Now that we have created the document term matrix we cannot use this in the model as the models require either a matrix or a data frame so we will convert this document term matrix into a data frame with the below code. There is a small problem where the matrix which we now create might add similar values together, we do get 2 and 3 instead of just 1 and 2 so we do a sign on that particular matrix so that we only have 0 or 1 in the data framecuisineDF <- as.data.frame(as.matrix(testDTM1)) finalcuisuine <- sign(cuisineDF) STEP 10Now that our ingredient set is cleaned we split it back to training and test and then impose a factor variable on the ingredients so that it becomes easier for the model to get the factor variable as that is our independent variable. We have split the dataset according to the number of rows in the train and test set. We also added the cuisine column from the original training data so that we know which cuisine each row belongs to.traincleaned <- finalcuisuine[1:39774,] testcleaned <- finalcuisuine[39775:49718,]traincleaned$cuisine <- train$cuisine traincleaned$cuisine <- as.factor(traincleaned$cuisine)

MODELING AND PREDICTION

Now that we have our cleaned data, we move on to the modeling part. This problem is a clear case of classification where we have to classify a recipe that has a certain number of ingredients(which is stored in our matrix) to a particular cluster of cuisine. I have used three models to do this classification

CTREE

“Ctree typically runs and behaves like a decision tree. But a Ctree essentially has more data and more values associated with it. Like a decision tree, it too has nodes with decision that are made and then split based on the different values. But ctree are multivariate, meaning that instead of creating the tree based on only one variable, it uses multiple variables or feature”(Source from my previous homework). So here we are going to create our ctree based on our ingredients and the cusine from our train dataset.ctreeingredient <- ctree(traincleaned$cuisine ~ ., data = traincleaned, controls = ctree_control(maxsurrogate =8,mtry = 100,maxdepth=8000)) plot(ctreeingredient ) Here we are passing three things, first is the cusine which is our dependent variable, second the total data set which is the traincleaned and certain controls that will define how the tree is supposed to be created. I have finazlized the parameters by using two different methods. I started out with a basic model with no hyperparameters. Then I tweaked a parameter one by one, adding one parameter onto the other and then finanlizing on which parameters is having the most impact. To finalize the value for each of the parameters, I used the caret package to train and used cross validation with 4 rounds to see what the exact figures where. I have not posted the said code because it will defeat the purpose of trying to explain the ctree model.
The Ctree model created using the above code will have each of the ingredient in the nodes with the higher occurring ingredients in the top node and the low occurring and low probability ingredients in the lower nodes. Once the parent node is set up then the probability of the other ingredients occurring in the event of the parent node occurring is calculated and a tree is created in such a fashion. If we were to plot the above tree we can see that there are only few ingredients that are really common whereas some are not.
The top ingredients are the parent node and the branches are determined on the probability of the child node happening when the parent node has already occurred. Here the maxsurrogate value is used to limit the number of children a particular parent can have. Since there are on an average eight ingredients in each of the recipe we can limit the number of children to 8. The mtry is set to 100 so that we can randomly sample at every node of the said tree so that we do not just take just one recipe and create the model. This will induce bias in a way where we are hardcoding certain occurrences of the ingredients. Since there are many number of ingredients we want the tree to have a good number of depth we give a maxdepth of 8000.Next we predict the model based on the test data set.predictedctree<-predict( ctreeingredient , newdata = testcleaned,list(level=.99) submmsion_ctree <- data.frame(id=test$id,cuisine=predictedctree) write.csv(submmsion_ctree ,"ctree4.csv",row.names=F) Here we are using the predict function and passing the model, the newdata which is the test dataset and certain parameters that we can use to predict the test data. We are using the hyper parameter in this case which deals with the level of confidence with which we can predict the tree which is really important as we did receive a lot of noise in the below parts of the tree. This submission score a value of 0.54938 which is not too bad considering that hyperparameters were chosen in a brute force manner and they might be overfitting the data.

GBM(GENERALIZED BOOSTING REGRESSION MODEL)

Although GBM is usually used in a regression model, they can also be used in a classification method. The reason why I choose GBM is because GBM has a boosting algorithm that helps in truly randomizing the sample set that we use to build our model. Since this problem is nothing but finding out what arrangement of recipes has the highest percentage of being in a cuisine, boosting algorithms with their high sampling rate will surely help us in better predict the class of the recipe. Moreover GBM technique uses a step size increment in building the model which helps us train and test the model and repeatedly better the model even before it is applied to the test data. Hence it is very useful in creating a good prediction.
To build the model we are going to use the below code.traincleanedgbm<-traincleaned[1:10000,] Here I am going to start by subsetting the data. The reason for me doing this is because since GBM is such a processor intensive model, it was almost impossible for me to train the model using my laptop on the entire training data as it is very big. Instead I decided to reduce the number of rows in the training data and then using that as a training dataset to train the model. I am aware of the fact this this means reduction in accuracy but that is price to pay for a limited process laptop.Now we can run the model using the below code.GBM_model_cuisine <-gbm(traincleanedgbm$cuisine~., data=traincleanedgbm, distribution="multinomial", n.trees=1000, cv.folds = 3, verbose=TRUE, n.cores=4) Here we are passing the newly subset training data frame, the dependent cuisine variable and other parameters. These parameters were chosen considering the data at hand. Since this is a problem with more than 2 class, we choose the distribution of the data as multinomial. This multinomial means that there is more than two cuisine or ingredients. I am using only a limited number of trees to 1000 as that seemed more than enough. To make up for the lack of trees I used cross validation and did a 3 fold cross validation to make sure that the predicted function is closer to the objective function. Since this is intensive model and ran for almost 2 and a half hours I decided to keep verbose on so that I can see the model being built as I suspected that R studio was hanging. I also kept the number of cores to 4 as I wanted the model to use all the cores in my laptop.
After building my model, I used the test data to predict it. I used a 1000 trees here too to make sure that I get the right prediction.GBM_predict_cuisine <- predict(GBM_model_cuisine ,testcleaned,n.trees=1000) GBM_predict_csv <- data.frame(id=test$id, cuisine= GBM_predict_cuisine ) write.csv(GBM_predict_csv, file = 'trainCusines8.csv') This got me a prediction of 0.67166. I could have gotten a better prediction if I had used the entire training dataset. Considering that I used only 26% of the data, this is a very good prediction that I got. Also a thing to note here is that the prediction that GBM gave me was a probability based output of what is the probability of every recipe belonging to each of the different cuisine. So I had to edit the CSV to find out the highest probability of each recipe using the excel MAX function and substituting the cuisine for that MAX probability. Since I couldn’t accomplish this in R I did it in Excel before submitting it to Kaggle.

XGBOOST

This was the most highest rewarding method that I used so far. XGboost, like GBM is also a boosting algorithm which is used for classification. The only difference between XGboost and GBM is that it is less susceptible to overfitting(Source) Here we will be using the same type of hyper parameters in this model. Since I did not have much idea about this model I referred to a similar model used here for selecting the hyperparameters since there is an ideal way to selecting the hyperparameters in this location. I used a similar method that helped me fit my model using my data.
Here before we start we need to create a data matrix that is specific to XGboost which is called as an xgb.DMatrix. This is an advanced matrix that will further help XGboost to better predict as we are providing other types of metadata along with the model.XGboost_matrix <- xgb.DMatrix(data.matrix(traincleaned), label=as.numeric(traincleaned$cuisine)-1) Here we are creating a matrix that is using the training dataset as the data and the label for this data as the cuisine dataset. Since the dataset in cuisine has a label value of “cuisine” at the top we will remove that from the label by doing a -1 from the label item.
Now we train the data using the below code.XGbosst_model <- xgboost(XGboost_matrix, eta = 0.09, nround = 2000, objective = "multi:softmax", num_class = 20, nthread = 4, nround = 2 ,max.depth = 4) Here we are passing the data matrix that we created before and some parameters. We define the step size here which is 0.09 because a larger step size will reduce our efficiency. We will perform 2000 steps in total so that we can iterate and create a proper model. Since this is a classification problem we need to specify the objective which is multi:sofmax. There are a total of 20 cuisine which means there are total of 20 classes. Since this model too takes too long we will be using 4 different threads to run this model. We will iterate through the data twice since the second time the model will try to reduce the error function. Since the data is fairly deep we will use only 4 as max. depth.Now we will predict the test data based on the model we created aboveXGboost_predicted_cuisine <- predict(XGbosst_model, newdata = data.matrix(testcleaned)) The test data also needs to be in a matrix if not a Dmatrix. This will give us the cuisine value which we can combine with the test data to make the submission file as below.XGboost_predicted_cuisine <- predict(XGbosst_model, newdata = data.matrix(testcleaned)) XGboost_submission <- data.frame(id=test$id,cuisine=XGboost_predicted_cuisine) write.csv(XGboost_submission,"sub5.csv") This model gave me the highest value of 0.80078 due to its resistance to overfitting the data. This model is less processor intensive as it is capable of running in threads which speeds up the process. The only problem I faced was to find the correct parameters for the model. I still believe that I can get a better score if I can fine tune these hyperparameters.

SUMMARY

Hence the results of the three models are summarized in the below table
Model Method ScoreCtree 0.54938GBM 0.67166XGboost 0.80078
My current standing on the leaderboard is 75 with the top leader of the competition having a score of 0.82271. My XGboost score was the highest, followed by GBM and ctree. I did make a lot of other submissions using Naïve Bayes model and the all Italian model but the scores were below average.

There is still a lot of room for improvement in GBM and in XGboost which, if I further tune my hyperparameters I am sure that I can reach a score of atleast 0.81.

ENTIRE R CODE1. ######LOADING AND UNDERSTANDING THE DATA########### 2. library(jsonlite) 3. options(stringsAsFactors = FALSE) 4. test <- fromJSON("test.json", flatten = TRUE) 5. train <- fromJSON("train.json", flatten = TRUE) 6. qplot(traincleaned$cuisine, geom="histogram") 7. ######CLEANING THE DATA########## 8. combined <- rbind(train[,c("id", "ingredients")],test) 9. combined$ingredients <- lapply(combined$ingredients, FUN=function(x) gsub("-", " ", x)) 10. testcorpus <- Corpus(VectorSource(combined$ingredients)) 11. cleanedtest <- tm_map(testcorpus,removeWords,stopwords("english")) 12. cleanedtest1 <- tm_map(cleanedtest,stripWhitespace) 13. testdtm2 <- tm_map(cleanedtest1, stemDocument) 14. testDTM<- DocumentTermMatrix(testdtm2) 15. testDTM1 <- removeSparseTerms(testDTM, 0.99) 16. findFreqTerms(testDTM1, 2000) 17. cuisineDF <- as.data.frame(as.matrix(testDTM1)) 18. finalcuisuine <- sign(cuisineDF) 19. traincleaned <- finalcuisuine[1:39774,] 20. testcleaned <- finalcuisuine[39775:49718,] 21. traincleaned$cuisine <- train$cuisine 22. traincleaned$cuisine <- as.factor(traincleaned$cuisine) 23. ########CTREE########## 24. ctreeingredient <- ctree(traincleaned$cuisine ~ ., data = traincleaned, controls = ctree_control(maxsurrogate =8,mtry = 100,maxdepth=8000)) 25. plot(ctreeingredient ) 26. predictedctree<-predict( ctreeingredient , newdata = testcleaned,list(level=.99) 27. submmsion_ctree <- data.frame(id=test$id,cuisine=predictedctree) 28. write.csv(submmsion_ctree ,"ctree4.csv",row.names=F) 29. #########GBM############ 30. traincleanedgbm<-traincleaned[1:10000,] $Description: D:\RPI Stuff\MGMT6963-2015\Kaggle 2\3.JPG$ 31. GBM_model_cuisine <-gbm(traincleanedgbm$cuisine~., data=traincleanedgbm, distribution="multinomial", n.trees=1000, cv.folds = 3, verbose=TRUE, n.cores=4) 32. GBM_predict_cuisine <- predict(GBM_model_cuisine ,testcleaned,n.trees=1000) 33. GBM_predict_csv <- data.frame(id=test$id, cuisine= GBM_predict_cuisine ) 34. write.csv(GBM_predict_csv, file = 'trainCusines8.csv') 35. #########XGBOOST############ 36. XGboost_matrix <- xgb.DMatrix(data.matrix(traincleaned), label=as.numeric(traincleaned$cuisine)-1) 37. XGbosst_model <- xgboost(XGboost_matrix, eta = 0.09, nround = 2000, objective = "multi:softmax", num_class = 20, nthread = 4, nround = 2 ,max.depth = 4) 38. XGboost_predicted_cuisine <- predict(XGbosst_model, newdata = data.matrix(testcleaned)) 39. XGboost_submission <- data.frame(id=test$id,cuisine=XGboost_predicted_cuisine) 40. write.csv(XGboost_submission,"sub5.csv")

Tuesday, 2 February 2016

Kaggle Competition on Titanic

Titanic, or RMS Titanic, “was a British passenger liner that sank in the North Atlantic Ocean in the early morning of 15 April 1912 after colliding with an iceberg during her maiden voyage from Southampton, UK, to New York City, US”(Wikipedia).

The aim of the Kaggle project here, based on the data that is collected from the manifest of titanic, to predict who had a better chance of survival. There can be various concepts applied to the dataset like machine learning, logistic regression to determine based on the characteristic of each person, if he had a better chance at survival than the others in the ship.

The ship was a passenger line and had various classes and each class had varying capacity.

Image source

As shown above the titanic had 1^st class, 2^nd class, 3^rd class and the crew members.

To correctly predict the chance of an individual surviving depends on various factors. Let us look at the data first to get a clear picture of what features can be used to predict survival.

PassengerID:- This is just a sequential Id that is introduced to uniquely identify each person and does not hold any information.
Survived:- This is the key field which is used to predict various models. The survived column tells us if the person has indeed survived or not. 0 stands for dead and 1 stands for survived.
Based on the survival data of the training set, a model is created and is used on the test set.
Pclass:- Pclass tells us which class the person belongs to. 1 is for 1^st class, 2 is for 2^nd class and 3 is for 3^rd class. To build a class based model, this column is very helpful to determine if being in a higher class has a better chance at survival.
Name:- This is the name of the individual. Although not much information can be retrieved from the name, the salutation(Mr, Miss, Mrs, Doctor) is an important factor here. It can be extrapolated to understand if salutation has any impact on survival i.e if doctors are more likely to survive than a Mr. Family names can also be grouped together to find relations between different families.
Sex:- Tells us the sex of the person. It is very useful in the data set and is used as a complimentary variable is certain model. The simplest model that you can come up with is which gender has a better chance at survival.
Age:- This is the age of the person. It is a significant variable as a lot of knowledge can be extrapolated. For eg the ages can be grouped together in an interval and used to predict which age group has more chances at survival.
Sibsp:- This is the number of siblings or spouses aboard the titanic. This is too is an important variable as it can be used to determine if a single person has more chance at survival than a large group of people.
Parch:- It is the same as Sibsp but this indicates the number of parents or children aboard the titanic. The same models that is used for Sibsp can be used of Parch.
Ticket:- It is the ticket number of the person. Although this is not a favorable attribute, the data itself has certain inquisitive values but has to be cleaned first.
Fare:- The money paid to receive the ticket. This can be used to determine the spending capacity of the person. This can also affect his survival as more spending usually means better status and maybe reservations to get in the lifeboat. A class type classification can help make more sense of this attribute.
Cabin:- Cabin in which the person belonged in. Although there are numbers in the data, the numbers can be omitted and prediction can be done solely on cabin locations as the lower cabins might have fewer chances at survival than the cabins that are above.
Embarked:- Shows where the person embarked on titanic. The three values that this column can take are C for Cherbourg, Q for Queenstown and S for Southampton. The embarked location can also be used as a supplement in the model as people that embarked from a particular location have a better chance at survival than people who embarked at a different location.

I will be looking at two solutions, one in R and the other in Python.

Original Code Link:- https://www.kaggle.com/manskj/titanic/ctree-test

Model Used:- Ctree

Ctree typically runs and behaves like a decision tree. But a Ctree essentially has more data and more values associated with it. Like a decision tree, it too has nodes with decision that are made and then split based on the different values. But ctree are multivariate, meaning that instead of creating the tree based on only one variable, it uses multiple variables or feature.

In this model the user is trying create a c tree based on some of the randomly picked features.

Usually ctree is restricted based on the surrogate which is the levels that is created in the decision tree, but since the data is too low, this restriction is not imposed. The user creates the ctree based on the training set and then uses the created decision tree to predict the test dataset. Here the prediction is just a survived or not survived flag. The initial part of the code deals with eliminating the NA values in the age and fares and then moves on to counting the number of tickets and then plugging all of those in the model.

Original Code in black with explanation in Red:-

library(ggplot2)

Library used for plotting graphs

library(party)

Library for the ctree

library(caret)

Library for the Confusion matrix

library(e1071)

library(randomForest)

Library for the random forest

set.seed(1)

train <- read.csv("../input/train.csv", stringsAsFactors=FALSE)

test <- read.csv("../input/test.csv", stringsAsFactors=FALSE)

Importing both the test and the training data without converting string data as factors

train$Cat <- 'train'

test$Cat <- 'test'

test$Survived <- NA

full <- rbind(train,test)

Creates a separation between the test and the train data by introducing one column with test and train as text in them. Then a new data frame is created that combines test and train.

train$Age[grepl(" Master\\.",train$Name) & is.na(train$Age)] <- mean(full$Age[grepl(" Master\\.",full$Name) & !is.na(full$Age)])

test$Age[grepl(" Master\\.",test$Name) & is.na(test$Age)] <- mean(full$Age[grepl(" Master\\.",full$Name) & !is.na(full$Age)])

train$Age[grepl(" Miss\\.",train$Name) & is.na(train$Age)] <- mean(full$Age[grepl(" Miss\\.",full$Name) & !is.na(full$Age)])

test$Age[grepl(" Miss\\.",test$Name) & is.na(test$Age)] <- mean(full$Age[grepl(" Miss\\.",full$Name) & !is.na(full$Age)])

train$Age[grepl(" Mr\\.",train$Name) & is.na(train$Age)] <- mean(full$Age[grepl(" Mr\\.",full$Name) & !is.na(full$Age)])

test$Age[grepl(" Mr\\.",test$Name) & is.na(test$Age)] <- mean(full$Age[grepl(" Mr\\.",full$Name) & !is.na(full$Age)])

train$Age[grepl(" Mrs\\.",train$Name) & is.na(train$Age)] <- mean(full$Age[grepl(" Mrs\\.",full$Name) & !is.na(full$Age)])

test$Age[grepl(" Mrs\\.",test$Name) & is.na(test$Age)] <- mean(full$Age[grepl(" Mrs\\.",full$Name) & !is.na(full$Age)])

train$Age[grepl(" Dr\\.",train$Name) & is.na(train$Age)] <- mean(full$Age[grepl(" Dr\\.",full$Name) & !is.na(full$Age)])

test$Age[grepl(" Dr\\.",test$Name) & is.na(test$Age)] <- mean(full$Age[grepl(" Dr\\.",full$Name) & !is.na(full$Age)])

Here the mean of age of the full dataframe which now contains the entire test and train is used to calculate the mean age. This age is then used to plugin into the NA in the test and train data separately. This is the correct approach as thee test and train data together will give the correct mean as compared to only using test or train. Please note that the mean is dependent on what the salutation is that is the mean of ages with the salutation MR is calculated and is inserted in the test and train data of the NA which has the salutation MR and not to other salutation.

agemodel <- glm(Age ~ Pclass + Fare + Pclass:Fare,data=train)

This creates a generalized linear model where the Age is used as the response and pClass, fare and pclass and fare combined as the predictor. The train data is used to make this model.

train$Age <- ifelse(is.na(train$Age),predict(agemodel,train[is.na(train$Age),]),train$Age)

test$Age <- ifelse(is.na(test$Age),predict(agemodel,test[is.na(test$Age),]),test$Age)

Here another alternative method to calculate the age for the NA is done. Here the previously made generalized linear model is used to predict the age for the missing values based on the test data without the NA.

train$Fare[is.na(train$Fare)] <- median(train$Fare, na.rm=TRUE)

test$Fare[is.na(test$Fare)] <- median(train$Fare, na.rm=TRUE)

Putting the median value for the fare in place of the median.

train$Title<-sapply(train$Name,function(x) strsplit(x,'[.,]')[[1]][2])

test$Title<-sapply(test$Name,function(x) strsplit(x,'[.,]')[[1]][2])

Creating a new field with name title by stripping off the salutation using the logic that the salutation exist between the , and the . which has the data we need

full <- rbind(train,test)

full$Title<-gsub(' ','',full$Title)

Removing any blank spaces in the title especially the blanks calculated when the separation was done

full$Title[full$Title %in% c('Capt','Col','Don','Sir','Jonkheer','Major','Master')]<-'Mr'

full$Title[full$Title %in% c('Lady','Ms','theCountess','Mlle','Mme','Ms','Dona')]<-'Miss'

Generalize the type of salutation into a Mr or Miss catagory

full$Cabin <- substr(full$Cabin,1,1)

full$Cabin <- as.factor(full$Cabin)

full$Title <- as.factor(full$Title)

Create a factor variable for the cabin by using a substring on the cabin value.

train <- full[full$Cat == 'train', ]

test <- full[full$Cat == 'test', ]

Separate the two train and test data frame into their original form.

full$col1<-1

agreg <- aggregate(col1~Ticket, data=full, FUN=sum)

Aggregating the col1 field which is set to one, then sum based on the ticket to find out similar ticket number by summing up col1. This will given the number of of the same ticket in the entire set.

for (i in 1:length(agreg[,1]) ) {ifelse(agreg[i,2]<3,0,full$Ticket[full$Ticket==agreg[i,1]]<-0)}

Iterating through the agreg and setting 0 if the count is less than 3 or else put the value of the ticket in the field. Hence we will have only ticket number that are greater than 3.

train$Ticket <- factor(full$Ticket)[1:891]

test$Ticket <- factor(full$Ticket)[892:1309]

Substituting back the factor variable back into the train and test set

extractFeatures <- function(data) {

features <- c("Survived",

"Cabin" ,

"Title",

"Pclass",

"Age",

"Ticket",

"Fare",

"SibSp",

"Sex")

fea <- data[,features]

#fea$Embarked[fea$Embarked==""] = "S"

fea$Sex <- as.factor(fea$Sex)

#fea$Embarked <- as.factor(fea$Embarked)

fea$Survived <- as.factor(fea$Survived)

return(fea)

}

This function is created to extract the features from the train set. Since we need these features from the training set to put in our model, we are simplifying the process by creating a function. This will subset only the features and create factor variable for relevant fields.

extractFeatures2 <- function(data) {

features <- c("Cabin" ,

"Title",

"Ticket",

"Pclass",

"Age",

"Fare",

"SibSp",

"Sex"

)

fea <- data[,features]

#fea$Embarked[fea$Embarked==""] = "S"

fea$Sex <- as.factor(fea$Sex)

fea$Cabin <- as.factor(fea$Cabin)

fea$Title <- as.factor(fea$Title)

#fea$Embarked <- as.factor(fea$Embarked)

return(fea)

}

This is the same fuction as the previous one but for the test data. It performs the same function as the previous one.

thedata <- extractFeatures(train)

Extracting the features from training data and putting them in the thedata variable.

myctree <- ctree(Survived ~ Pclass +Sex +SibSp +Cabin+Title+Fare+ Ticket+Pclass:Sex + Age, data = extractFeatures(train))

model <- myctree

This is the main step of the entire code. Here is where the ctree is created. The ctree is created using survived as the response and Pclass +Sex +SibSp +Cabin+Title+Fare+ Ticket+Pclass:Sex + Age as the predictor. The training data is used to predict this tree. As explained above, the ctree is created using probability and the nodes here are the responses. This creates a comprehensive tree that categorizes various features and their impact on survival.

trainsubtest <- data.frame(PassengerId = train$PassengerId)

Creating a new subset data frame to output the result.

trainsubtest$Survived <- predict(model, extractFeatures(train))

pred <- predict(model,extractFeatures(train))

This is just a test code that the user must have made so ignoring it.

output <- confusionMatrix(train$Survived,trainsubtest$Survived)

print(output)

submission <- data.frame(PassengerId = test$PassengerId)

submission$Survived <- predict(model, extractFeatures2(test))

Based on the model made above, a prediction is made based on the test data. This will output if based on the features selected by the ctree and the probability of the survival, if the new person will survive or not.

write.csv(submission, file = "1_ctree_submission.csv", row.names=FALSE)

Writing the solution back into a csv file for submission.

This solution gives a value of 0.7994 which put from the low 1800 to jump up to 800’s. J

Python solution:-

Original Code Link:- https://www.kaggle.com/rhoslug/titanic/logistic-model/run/20085

Model Used:- Logistic Model

Logistic Model uses probability to determine if a particular event is about to happen or not. It is capable of doing this by using the training set data to generate probability of each event based on certain values of the feature. Then it sets a particular threshold that would differentiate between two different outcomes. Then based on this prediction model, you can apply to the test set to check based on the model created, what is the probability that the event will occur. If it is greater than the threshold then the even will occur and if it is less than the threshold then it wont.

Code:-

import numpy as np

For Numerical computation

import pandas as pd

For data frame manipulation

from sklearn.calibration import CalibratedClassifierCV

Not used

from sklearn.ensemble import RandomForestClassifier

For the Random forest generation

from sklearn.grid_search import GridSearchCV

Not Used

from sklearn.linear_model import LogisticRegression

For Logistic regression

from sklearn.naive_bayes import GaussianNB

Not Used

from sklearn.neural_network import BernoulliRBM

Not Used

from sklearn.pipeline import Pipeline

For creation for pipeline

from sklearn.preprocessing import PolynomialFeatures, Imputer

assisting in the logistic regression

from patsy import dmatrices, dmatrix

For creating confusion matrix

#Print you can execute arbitrary python code

df_train = pd.read_csv("../input/train.csv", dtype={"Age": np.float64}, )

df_test = pd.read_csv("../input/test.csv", dtype={"Age": np.float64}, )

Importing the data and create a pandas data frame

# Drop NaNs

df_train.dropna(subset=['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'], inplace=True)

Here directly the NA are dropped and only the rows that have no NA are considered. The entire row is dropped if even one value has a NA in it.

print("\n\nSummary statistics of training data")

print(df_train.describe())

Outputs the summary of the training data

# Age imputation

df_train.loc[df_train['Age'].isnull(), 'Age'] = np.nanmedian(df_train['Age'])

df_test.loc[df_test['Age'].isnull(), 'Age'] = np.nanmedian(df_test['Age'])

Redundant as NAN values are already removed

# Training/testing array creation

y_train, X_train = dmatrices('Survived ~ Age + Sex + Pclass + SibSp + Parch + Embarked', df_train)

This creates two different matrices that contain different values. X_train will have the response for the model and the y_train will have the predictor. This is necessary in the prediction model.

X_test = dmatrix('Age + Sex + Pclass + SibSp + Parch + Embarked', df_test)

Convert the df_test in a similar way but dmatrix converts it onto 1 dimension model rather than a two dimensional.

# Creating processing pipelines with preprocessing. Hyperparameters selected using cross validation

steps1 = [('poly_features', PolynomialFeatures(3, interaction_only=True)),

('logistic', LogisticRegression(C=5555., max_iter=16, penalty='l2'))]

Here we are generating a step for a pipeline. A pipeline essentially is a type of structure where we can introduce sequential transformation to a particular data set. Here we are first calculating the polynomial feature and then creating a logistic regression. This is kind of an abstract way of defining a model whoch can be used to different dataset without having to use again similar to object oriented programming

steps2 = [('rforest', RandomForestClassifier(min_samples_split=15, n_estimators=73, criterion='entropy'))]

Similar to the previous one but with a randomforestclassifier.

pipeline1 = Pipeline(steps=steps1)

pipeline2 = Pipeline(steps=steps2)

Both the pipelines are now created.

# Logistic model with cubic features

pipeline1.fit(X_train, y_train.ravel())

print('Accuracy (Logistic Regression-Poly Features (cubic)): {:.4f}'.format(pipeline1.score(X_train, y_train.ravel())))

Here we are trying to fit the training data using the first pipeline. The estimator for the fitting of the data is y_train. The ravel is used to convert y_train to a 1 d array so that it is consistent with x_train. We also want to see the accuracy with which the logistic regression is able to predict the outcome. We are using this just to confirm which method to go with.

# Random forest with calibration

pipeline2.fit(X_train[:600], y_train[:600].ravel())

calibratedpipe2 = CalibratedClassifierCV(pipeline2, cv=3, method='sigmoid')

calibratedpipe2.fit(X_train[600:], y_train[600:].ravel())

print('Accuracy (Random Forest - Calibration): {:.4f}'.format(calibratedpipe2.score(X_train, y_train.ravel())))

The same as above only difference is here we are creating a random forest classification. After creating the classification we are checking the score of this as well. Since the accuracy is less than that of Logistic regression we choose Logistic regression

# Create the output dataframe

output = pd.DataFrame(columns=['PassengerId', 'Survived'])

output['PassengerId'] = df_test['PassengerId']

# Predict the survivors and output csv

output['Survived'] = pipeline1.predict(X_test).astype(int)

output.to_csv('output.csv', index=False)

Creating the output dataset buy predicting for the test data X_test, by using logistic regression which is used in pipeline one.

Comparing both the solution via predication capability, the ctree solution has a better score of 0.7994 as compared to the logistic solution of 0.7655. The main reason being is that the logistic solution ignored all of the NAN values of the data set. This definitely reduced the accuracy with which the probability could be calculated. The ctree had a better score because it had a much sophisticated method of substituting for the NA and that the model also took into many other features than logistic model.