Explore squirrel behavior in NYC’s Central Park through ML: Clustering sightings & predicting encounters with interactive insights
NYCOpenData has a treasure trove of both interesting and rich datasets to explore related to topics concerning health, environment, business, and education. I stumbled upon the 2018 Central Park Squirrel Census dataset and I knew immediately that I had to do something with it. This dataset deals with squirrel sightings collected over the course of two weeks by volunteers in Central Park. After looking through the data dictionary, I was drawn to a feature named ‘Approaches’ that denotes whether a squirrel was observed approaching a human. I thought it’d be neat to train a machine learning (ML) model to assist me in determining whether a squirrel located within the bounds of Central Park would approach me. This article will go through this weekend project where I detail the entire process towards building that model. There’s a little bit of everything in this project: there’s work with geospatial data, clustering, visualization, feature engineering, unstructured text, model training, model calibration and model deployment.
I deployed the model in a streamlit app where you can enter your coordinates and other features which will tell you the probability of a squirrel approaching you. You can play with it here. Also, if you are interested in looking through some of the code, I’ve posted the .ipynb here.
The data loading was pretty standard.
ini_squirrel_df = pd.read_csv('/content/drive/MyDrive/SquirrelML/NYC_Squirrels.csv')
To carry out the initial EDA I used dataprep to rapidly get an initial idea of what sorts of feature distributions, cardinalities, patterns, missing data, and correlations are present in the raw dataset. You can see the report here. There were several useful insights that I gained from this which allowed me to plan my subsequent feature engineering and remove redundant/unnecessary features. Some of the most notable observations I gleaned from this EDA were as follows:
- The dataset is comprised of…
Be the first to comment