Tuesday 9 February 2016

What's Cooking - Kaggle Competition


INTRODUCTION

“What’s cooking?”  is a kaggle competition with data provided by yummly.com. The objective of this competition is to efficiently categorize the ingredients used in a recipe to a cuisine category. Every cuisine has its own list of ingredients that is either unique to that specific cuisine or it can be some general ingredients that are used in almost all the cuisines.The objective of the competition is to predict the cuisine by looking at the different ingredients that is used to make that cuisine. The training data gives us a specific set of recipes and the cuisine that the particular recipe belongs too along with a unique ID. The submission file that is to be submitted is on a test data that should contain the ID of the recipe and the cuisine that we have predicted that it belongs to. 

DATA

The data that is provided for this competition is in the form of JSON files. The JSON file format is favorable in this case as the data is mainly textual data and that it has a non-standard format of data. It is non-standard in the sense that the number of ingredients that is associated with each of the recipe varies from recipe to recipe and from cuisine to cuisine. Hence JSON is more favorable here than CSV or XLSX. There are total of 3 files
TRAIN.JSONThis is the training data that has a total of 39774 recipes along with a total of 20 cuisines in all. Below is a screenshot of a single recipe. 
As we can see above we have the recipe as the single unit of the data. Inside each recipe we have 3 main fields.1) ID:- This is the ID that is used to uniquely identified each and every recipe. It doesn’t seem to be sequential but randomly sampled to make sure that there is no selection bias.2) Cuisine:- This signifies the type of cuisine that the particular recipe belongs too. There are a total of 20 cuisine types. This number is important to know as these are the classes or groups into which our predictions can fall into.3) Ingredients:- These are the ingredients that are used to make this particular recipe. Every single recipe has a different number of ingredients and different ingredients in each recipe. This will be the input for our model. Based on these ingredients we will build our models to determine given the set of ingredients, what cuisine it actually belongs to.
TEST.JSONThis is the test data set. This is the dataset for which we have to determine what cuisine it really belongs to. It has a total of 9944 recipes and only two variables. Below is the screenshot of the test file. 
Below is what the fields contain1) ID:- Same characteristics as the ID column in training data. These are new recipes that are not present in the training data. It also doesn’t say which cuisine it belongs to which is what we need to figure out.2) Ingredients:- Same characteristics as the ingredients column in the training data. 
SAMPLE_SUBMISSION.CSVThis is the sample submission file that shows us the format in which the prediction of the test set is to be submitted. This is given because since Kaggle calculates the percentage of how many of the recipes you got right automatically; it needs to be in a predetermined format which is given below. 
This format is to be saved in a csv format as it is fairly straightforward data output. The column “ID” has the ID column same as that of the test.json file. The “cuisine” is the predicted column that would be outputted from the model that we will be building.



POSSIBLE APPROACHES AND UNDERSTANDING OF DATA

This problem is clearly a classification problem. We have to classify every recipe into a particular cuisine. We can consider each and every cuisine as a class and try to predict which recipe can fall into a particular class. There can be multiple approaches to the problem. To understand this we need to first understand the data in particular. I have created all my models in R. I will not be including any install statements as they are irrelevant to this solution set. So lets first import the data using a library called jsonlite. This helps to import the said data and store it in a R variable.library(jsonlite)  options(stringsAsFactors = FALSE)  test <- fromJSON("test.json", flatten = TRUE)  train <- fromJSON("train.json", flatten = TRUE)We have set the options variable here as stringsAsFactors = FALSE because we do not want to convert any texual data into a factor as  we are not sure if it is the right way to proceed. We go ahead and import the data into a variable called test for the the test.json and train from train.json. A quick look into this variable will show you just how the data is stored in R. Below is the output that we receive when we look into test variable.Looking at the train data we can understand that there are a total of 20 different cuisines that are present but the distribution amongst these 20 cuisines is not even. We can see that most of the cuisines belong to “Italian”, “mexican”  or “southern_US”. All of the other  categories have a lot less cuisines as compared to these cuisines.Below is the histogram that aptly shows the distribution of the cuisine in the data set. This is created using the library called ggplot2. The code is as followsqplot(traincleaned$cuisine, geom="histogram")    
From this we can proceed with a naïve model where we can assume that all the recipes are Italian and then submit it. Although this is an extremely simple model with not much prediction involved, it gives us a very low score of 0.19. I will not be considering this model because of these reasons.We can also see that there are a lot of errors in the data that can be of hindrance to us. Like there are ingredients like “all-spices” where the “–“can be of a problem while running some models. Also there are some repetitions with minor changes in the name like “fresh ginger” and “fresh ginger root”. Although both of them are technically the same ingredient, the model will consider them as two different entities. This will cause problems as due to this change in ingredient there is a possibility that it might not get rightly classified into the right category. We will have to clean this ingredient list to remove this problem.
There are also certain other parameters that are useful. For example there are certain ingredients that have a high probability to being in a particular cuisine.  For eg tortilla is heavily related to Mexican cuisine or garam masala with indian cuisine. We can find out these ingredients by using our model. Below is a diagram that lists out the various ingredients that have more probability to be in a particular cluster, which is nothing but a cuisine in this case. The code for this particular graph is in one of the solutions below as to get this graph we need to have a model in place. 








CLEANING OF DATA

As I mentioned above, there are lot of inconsistencies in the data and needs to be cleaned before we can do any type of operations on it. STEP 1The inconsistencies in the ingredients are present both in the test and the train data. So to remove them we need to combine the ingredients together into one single variable.combined <- rbind(train[,c("id", "ingredients")],test)  This will combine the ID and ingredients from the training data and the same from test and put it into combined. This is just for easier computation. Once we clean it we will split them back into their original sets which is train and test.
STEP 2As mentioned previously, the “-”  in the data will cause problems in the  modeling part (especially the XGboost) so we need to remove that. We will be using the lapply function from the base package of R. we will iteratively look through the variable that holds the ingredients and remove any – and replace it with a blank space. I will not be removing it as it will mean combining both the words. We will iterate using the function call in lapply to remove all the “-” in the ingredients part.combined$ingredients <- lapply(combined$ingredients, FUN=function(x) gsub("-", " ", x))  Here gsub will substitute the “-” for the “ ” and function will iterate through the variable.
STEP 3Now we will use the text mining library that is called tm. It is exclusively used in text mining related data frames where we need to mine and format the data. To begin with we need to convert the variable to a corpus.  Corpus is a the base data type for the package tm before we do any type of analysis. Corpus is just a “collection of text documents and is an abstract concept”.(Source) testcorpus <- Corpus(VectorSource(combined$ingredients)) STEP 4There are certain stop words like in certain ingredients like “sweet and sour sauce” which is of no significance and should be removed. Using the same tm package we can remove the stopwords which are present in English like and, or etc by the below code.cleanedtest <- tm_map(testcorpus,removeWords,stopwords("english"))  STEP 5Now there are certain areas where there are some white space that needs to be removed as two different terms will be considered the same like “pepper” and “pepper  ” will be considered as two different ingredient. Hence we need to remove the white space. It is worth mentioneing that the white space between two different words like “black pepper” will not be removed as they are two separate words and the space in between does not constitute as a whitespace.cleanedtest1 <- tm_map(cleanedtest,stripWhitespace)  
STEP 6This is the most important step of all. There are certain values in the ingredient which are basically the same ingredient but are introduced as different ones. For eg, thigh and thighs are technically the same word but will be considered as two different ingredient. To prevent this we need to stem the document. Stemming will help us eliminate redundancies that are prevalent in the ingredients set and give us the root word only.testdtm2 <- tm_map(cleanedtest1, stemDocument)  STEP 7Now we have to convert this entire list of words and convert it into a data matrix. This data matrix is created in a way where each of the ingredients are used as columns and each of the recipe as rows. If a particular recipe has a particular ingredient, then the value for that recipe and that column is marked as 1 if it is present and 0 if not present. In this way we get a standardized data frame that we can use to work further.testDTM<- DocumentTermMatrix(testdtm2) STEP 8Next we have to eliminate values that are not that repetitive enough. These ingredients occur in almost only one or two recipes but are pretty inconsistent and are just a hindrance to our model. Although since we do not want to remove too many ingredients, we set the threshold very high so that only ingredients that are way to less as compared to other ingredients are removed. Here 0.99 signifies the testDTM1 <- removeSparseTerms(testDTM, 0.99)  We can also see the most frequent terms here by using the below code.findFreqTerms(testDTM1, 2000)   STEP 9Now that we have created the document term matrix we cannot use this in the model as the models require either a matrix or a data frame so we will convert this document term matrix into a data frame with the below code. There is a small problem where the matrix which we now create might add similar values together, we do get 2 and 3 instead of just 1 and 2 so we do a sign on that particular matrix so that we only have 0 or 1 in the data framecuisineDF <- as.data.frame(as.matrix(testDTM1)) finalcuisuine <- sign(cuisineDF)  STEP 10Now that our ingredient set is cleaned we split it back to training and test and then impose a factor variable on the ingredients so that it becomes easier for the model to get the factor variable as that is our independent variable.  We have split the dataset according to the number of rows in the train and test set. We also added the cuisine column from the original training data so that we know which cuisine each row belongs to.traincleaned <- finalcuisuine[1:39774,]  testcleaned <- finalcuisuine[39775:49718,]traincleaned$cuisine <- train$cuisine  traincleaned$cuisine <- as.factor(traincleaned$cuisine)  

MODELING AND PREDICTION

Now that we have our cleaned data, we move on to the modeling part. This problem is a clear case of classification where we have to classify a recipe that has a certain number of ingredients(which is stored in our matrix) to a particular cluster of cuisine. I have used three models to do this classification

CTREE

“Ctree typically runs and behaves like a decision tree. But a  Ctree essentially has more data and more values associated with it. Like a decision tree, it too has nodes with decision that are made and then split based on the different values. But ctree are multivariate, meaning that instead of creating the tree based on only one variable, it uses multiple variables or feature”(Source from my previous homework).  So here we are going to create our ctree based on our ingredients and the cusine from our train dataset.ctreeingredient <- ctree(traincleaned$cuisine ~ ., data = traincleaned, controls = ctree_control(maxsurrogate =8,mtry = 100,maxdepth=8000))  plot(ctreeingredient ) Here we are passing three things, first is the cusine which is our dependent variable, second the total data set which is the traincleaned and certain controls that will define how the tree is supposed to be created.  I have finazlized the parameters by using two different methods. I started out with a basic model with no hyperparameters.  Then I tweaked a parameter one by one, adding one parameter onto the other and then finanlizing on which parameters is having the most impact. To finalize the value for each of the parameters, I used the caret package to train and used cross validation with 4 rounds to see what the exact figures where. I have not posted the said code because it will defeat the purpose of trying to explain the ctree model.
The Ctree model created using the above code will have each of the ingredient in the nodes with the higher occurring ingredients in the top node and the low occurring and low probability ingredients in the lower nodes. Once the parent node is set up then the probability of the other ingredients occurring in the event of the parent node occurring is calculated and a tree is created in such a fashion. If we were to plot the above tree we can see that there are only few ingredients that are really common whereas some are not. 
The top ingredients are the parent node and the branches are determined on the probability of the child node happening when the parent node has already occurred. Here the maxsurrogate value is used to limit the number of children a particular parent can have. Since there are on an average eight ingredients in each of the recipe we can limit the number of children to 8. The mtry is set to 100 so that we can randomly sample at every node of the said tree so that we do not just take just one recipe and create the model. This will induce bias in a way where we are hardcoding certain occurrences of the ingredients. Since there are many number of ingredients we want the tree to have a good number of depth we give a maxdepth of 8000.Next we predict the model based on the test data set.predictedctree<-predict( ctreeingredient , newdata = testcleaned,list(level=.99)  submmsion_ctree <- data.frame(id=test$id,cuisine=predictedctree)  write.csv(submmsion_ctree ,"ctree4.csv",row.names=F)  Here we are using the predict function and passing the model, the newdata which is the test dataset and certain parameters that we can use to predict the test data. We are using the hyper parameter in this case which deals with the level of confidence with which we can predict the tree which is really important as we did receive a lot of noise in the below parts of the tree. This submission score a value of  0.54938 which is not too bad considering that hyperparameters were chosen in a brute force manner and they might be overfitting the data. 


GBM(GENERALIZED BOOSTING REGRESSION MODEL)

Although GBM is usually used in a regression model, they can also be used in a classification method. The reason why I choose GBM is because GBM has a boosting algorithm that helps in truly randomizing the sample set that we use to build our model. Since this problem is nothing but finding out what arrangement of recipes has the highest percentage of being in a cuisine, boosting algorithms with their high sampling rate will surely help us in better predict the class of the recipe.  Moreover GBM technique uses a step size increment in building the model which helps us train and test the model and repeatedly better the model even before it is applied to the test data. Hence it is very useful in creating a good prediction.
To build the model we are going to use the below code.traincleanedgbm<-traincleaned[1:10000,]  Here I am going to start by subsetting the data. The reason for me doing this is because since GBM is such a processor intensive model, it was almost impossible for me to train the model using my laptop on the entire training data as it is very big. Instead I decided to reduce the number of rows in the training data and then using that as a training dataset to train the model. I am aware of the fact this this means reduction in accuracy but that is price to pay for a limited process laptop.Now we can run the model using the below code.GBM_model_cuisine <-gbm(traincleanedgbm$cuisine~., data=traincleanedgbm, distribution="multinomial", n.trees=1000, cv.folds = 3, verbose=TRUE, n.cores=4)   Here we are passing the newly subset training data frame, the dependent cuisine variable and other parameters. These parameters were chosen considering the data at hand. Since this is a problem with more than 2 class, we choose the distribution of the data as multinomial. This multinomial means that there is more than two cuisine or ingredients. I am using only a limited number of trees to 1000 as that seemed more than enough.  To make up for the lack of trees I used cross validation and did a 3 fold cross validation to make sure that the predicted function is closer to the objective function. Since this is intensive model and ran for almost 2 and a half hours I decided to keep verbose on so that I can see the model being built as I suspected that R studio was hanging. I also kept the number of cores to 4 as I wanted the model to use all the cores in my laptop.
After building my model, I used the test data to predict it. I used a 1000 trees here too to make sure that I get the right prediction.GBM_predict_cuisine <- predict(GBM_model_cuisine ,testcleaned,n.trees=1000) GBM_predict_csv <- data.frame(id=test$id, cuisine= GBM_predict_cuisine )  write.csv(GBM_predict_csv, file = 'trainCusines8.csv')  This got me a prediction of 0.67166. I could have gotten a better prediction if I had used the entire training dataset. Considering that I used only 26% of the data, this is a very good prediction that I got. Also a thing to note here is that the prediction that GBM gave me was a probability based output of what is the probability of every recipe belonging to each of the different cuisine. So I had to edit the CSV to find out the highest probability of each recipe using the excel MAX function and substituting the cuisine for that MAX probability. Since I couldn’t accomplish this in R I did it in Excel before submitting it to Kaggle.

XGBOOST

This was the most highest rewarding method that I used so far. XGboost, like GBM is also a boosting algorithm which is used for classification. The only difference between XGboost and GBM is that it is less susceptible to overfitting(Source) Here we will be using the same type of hyper parameters in this model. Since I did not have much idea about this model I referred to a similar model used here for selecting the hyperparameters since there is an ideal way to selecting the hyperparameters in this location. I used a similar method that helped me fit my model using my data.
Here before we start we need to create a data matrix that is specific to XGboost which is called as an xgb.DMatrix. This is an advanced matrix that will further help XGboost to better predict as we are providing other types of metadata along with the model.XGboost_matrix <- xgb.DMatrix(data.matrix(traincleaned), label=as.numeric(traincleaned$cuisine)-1)  Here we are creating a matrix that is using the training dataset as the data and the label for this data as the cuisine dataset. Since the dataset in cuisine has a label value of “cuisine” at the top we will remove that from the label by doing a -1 from the label item.
Now we train the data using the below code.XGbosst_model <- xgboost(XGboost_matrix, eta = 0.09, nround = 2000, objective = "multi:softmax", num_class = 20, nthread = 4, nround = 2 ,max.depth = 4)  Here we are passing the data matrix that we created before and some parameters. We define the step size here which is 0.09 because a larger step size will reduce our efficiency. We will perform 2000 steps in total so that we can iterate and create a proper model. Since this is a classification problem we need to specify the objective which is multi:sofmax. There are a total of 20 cuisine which means there are total of 20 classes. Since this model too takes too long we will be using 4 different threads to run this model. We will iterate through the data twice since the second time the model will try to reduce the error function. Since the data is fairly deep we will use only 4 as max. depth.Now we will predict the test data based on the model we created aboveXGboost_predicted_cuisine <- predict(XGbosst_model, newdata = data.matrix(testcleaned))  The test data also needs to be in a matrix if not a Dmatrix. This will give us the cuisine value which we can combine with the test data to make the submission file as below.XGboost_predicted_cuisine <- predict(XGbosst_model, newdata = data.matrix(testcleaned))  XGboost_submission <- data.frame(id=test$id,cuisine=XGboost_predicted_cuisine)  write.csv(XGboost_submission,"sub5.csv")  This model gave me the highest value of 0.80078 due to its resistance to overfitting the data. This model is less processor intensive as it is capable of running in threads which speeds up the process. The only problem I faced was to find the correct parameters for the model. I still believe that I can get a better score if I can fine tune these hyperparameters. 

SUMMARY

Hence the results of the three models are summarized in the below table
Model Method ScoreCtree 0.54938GBM 0.67166XGboost 0.80078
My current standing on the leaderboard is 75 with the top leader of the competition having a score of 0.82271. My XGboost score was the highest, followed by GBM and ctree. I did make a lot of other submissions using Naïve Bayes model and the all Italian model but the scores were below average. 
 
There is still a lot of room for improvement in GBM and in XGboost which, if I further tune my hyperparameters I am sure that I can reach a score of atleast 0.81.























ENTIRE R CODE1. ######LOADING AND UNDERSTANDING THE DATA###########  2. library(jsonlite)    3. options(stringsAsFactors = FALSE)    4. test <- fromJSON("test.json", flatten = TRUE)    5. train <- fromJSON("train.json", flatten = TRUE)  6. qplot(traincleaned$cuisine, geom="histogram")    7. ######CLEANING THE DATA##########   8. combined <- rbind(train[,c("id", "ingredients")],test)  9. combined$ingredients <- lapply(combined$ingredients, FUN=function(x) gsub("-", " ", x))   10. testcorpus <- Corpus(VectorSource(combined$ingredients))   11. cleanedtest <- tm_map(testcorpus,removeWords,stopwords("english"))    12. cleanedtest1 <- tm_map(cleanedtest,stripWhitespace)    13. testdtm2 <- tm_map(cleanedtest1, stemDocument)    14. testDTM<- DocumentTermMatrix(testdtm2)   15. testDTM1 <- removeSparseTerms(testDTM, 0.99)    16. findFreqTerms(testDTM1, 2000)    17. cuisineDF <- as.data.frame(as.matrix(testDTM1))   18. finalcuisuine <- sign(cuisineDF)    19. traincleaned <- finalcuisuine[1:39774,]    20. testcleaned <- finalcuisuine[39775:49718,]  21. traincleaned$cuisine <- train$cuisine    22. traincleaned$cuisine <- as.factor(traincleaned$cuisine)   23. ########CTREE##########   24. ctreeingredient <- ctree(traincleaned$cuisine ~ ., data = traincleaned, controls = ctree_control(maxsurrogate =8,mtry = 100,maxdepth=8000))    25. plot(ctreeingredient )   26. predictedctree<-predict( ctreeingredient , newdata = testcleaned,list(level=.99)    27. submmsion_ctree <- data.frame(id=test$id,cuisine=predictedctree)    28. write.csv(submmsion_ctree ,"ctree4.csv",row.names=F)   29. #########GBM############   30. traincleanedgbm<-traincleaned[1:10000,]    Description: D:\RPI Stuff\MGMT6963-2015\Kaggle 2\3.JPG31. GBM_model_cuisine <-gbm(traincleanedgbm$cuisine~., data=traincleanedgbm, distribution="multinomial", n.trees=1000, cv.folds = 3, verbose=TRUE, n.cores=4)     32. GBM_predict_cuisine <- predict(GBM_model_cuisine ,testcleaned,n.trees=1000)   33. GBM_predict_csv <- data.frame(id=test$id, cuisine= GBM_predict_cuisine )    34. write.csv(GBM_predict_csv, file = 'trainCusines8.csv')   35. #########XGBOOST############    36. XGboost_matrix <- xgb.DMatrix(data.matrix(traincleaned), label=as.numeric(traincleaned$cuisine)-1)    37. XGbosst_model <- xgboost(XGboost_matrix, eta = 0.09, nround = 2000, objective = "multi:softmax", num_class = 20, nthread = 4, nround = 2 ,max.depth = 4)    38. XGboost_predicted_cuisine <- predict(XGbosst_model, newdata = data.matrix(testcleaned))    39. XGboost_submission <- data.frame(id=test$id,cuisine=XGboost_predicted_cuisine)    40. write.csv(XGboost_submission,"sub5.csv")    


2 comments:

  1. Hi, I have run your code line by line for XGBOOST. But it doesn't work.
    Xgboost_matrix has all zeros
    xgboost_model builds but on xgboost_matrix having all zeros.
    the prediction has all zeros
    the final csv file generated has all zeros in cuisines.

    Could you kindly review the code, Please. My project is based on your methodology. I will cite you in the references.

    ReplyDelete
  2. Play Fun88 Casino Bonus Code | 100% Up To €300
    Play fun88 is jeetwin a relatively new online casino site for sports betting enthusiasts. This 다파벳 site features a host of unique bonus fun88 soikeotot offers.

    ReplyDelete