Kaggle Playground 4.06 Results. With another Kaggle Playground… | by Trevor Glenn

With another Kaggle Playground Competition in the books, I am proud to share my results and methods from this month’s competition!

I achieved my best Playground competition ranking yet, finishing in the top 33% of over 2,700 competitors, and along the way I learned quite a bit about working on multi-class classification problems. This month’s competition was a multiclass classification problem where we predicted the academic risk of student’s in higher education given a dataset on common student features. The dataset was provided by Kaggle and generated from a dataset collected by the Polytechnic Institute of Portalegre in Portugal and archived by the University of California, Irvine — a link to the original dataset can be found HERE

I started the competition with some exploratory data analysis to gain insights into the dataset, what my important features would be, and any transformations that may need to be made to feed the data into my model. I then began model building with a single RandomForestClassifier which I developed to be a baseline model due to its ease and simplicity when determining classifications. Then I built a few more machine learning models such as Gradient Boosting Classification models, and then my best model: a Categorical Boosting Classification model. This CatBoostClassification achieved a model accuracy score of 0.83513 which placed me at 879 out of 2739 competitors (top 33%)! This model was rather simple as I did not need to specify my parameters or further engineer features specifically for this model. I believe the main reason why this model performed so well is due to the many categorical features used to reach our categorical target, which falls within the models specialization. For further improvement I would partake in further feature selection as well as tune my model parameters using Optuna to reach a higher accuracy score.

From there I developed an Ensemble model using a Voting Classifier and Stacking Classifier that ensembled together many of the models I had built previously, as well as others that I was testing to see if it would improve my score. However, due to the amount of noise, the ensemble method was unable to beat my CatBoostClassifier score because the ensemble model had a tendency to overfit to the training data where as the CatBoostClassifier was robust enough to fit similarly to both the training and testing data. To improve my ensembles I would like to do further testing and tuning as well as perform CVSearch grids to find my optimal hyperparameters for each model.

Finally, I built a Deep Learning Classification model using TensorFlow. This model took the most time by far as I ran into quite a few issues while tuning my base DNN model. I struggled inputting class weights for the multiclass model, which then led me to trying various resampling methods to try to smooth the imbalanced dataset. The most successful resampling method ended up being SMOTE which surprised me as I have often believed the SMOTE method tends to add unnecessary noise to a model that can cause accuracy to decrease. Each resampling method vastly improved our ability to predict the minority class of “Enrolled” that our model otherwise struggled to predict. However, even with this increase in precision and recall to our “Enrolled” class, the resampling methods were unable to beat my base DNN models overall accuracy. In addition, this base DNN could not beat my Ensemble or CatBoostClassifier accuracy scores.

This project taught me a lot such as: simpler can perform better, SMOTE actually works well when you have a class imbalance (compared to over or under sampling), and DNNs are not always going to be an improvement over a standard ML model. This last point is very important as DNNs use a much higher compute than a CatBoostClassifier making it more costly for worse results in this case. This Playground Series competition gave me some solid experience using different techniques to improve a classification score, which will be very worthwhile for future models and competitions.

I am extraordinarily excited for the next Playground Series Competition and the chance to continue improving my machine learning and data science skills. Hopefully I can improve my ranking over this month as I strive to become a more complete data scientist. You can find the code I created for this competition on my GitHub under the kaggle_comps repository or by clicking this link HERE

#Kaggle #DataScience #MachineLearning #TensorFlow #DataAnalysis #Competition

Source link

Kaggle Playground 4.06 Results. With another Kaggle Playground… | by Trevor Glenn | Jul, 2024

Be the first to comment

Leave a Reply Cancel reply