CNN’s for Imbalanced Image Classification with TensorFlow | by Lucas See | Mar, 2024


Motivation:

This project started when I was exploring the Google Maps API which offers Google Streetview data through a simple API call. What struck me was the ability to quickly query and aggregate a set of streetview images for almost any geography of interest. While this is an incredible capability I quickly discovered that to train an image classifier I was going to run into the issue of a highly imbalanced dataset. Due to the variety of images, any single object — dogs, cars, trees, etc. would be absent from most images. The following article captures some approaches used to address the issues arising from training a model on an imbalanced dataset, it’s broken down into several pieces:

  1. Naive CNN using Accuracy
  2. CNN using F1 Score
  3. CNN using F1 Score and class weighting
  4. Data Augmentation

Getting Started:

First, we need to gather data, to do this google streetviews python accessible API was used with the dataset uploaded here: https://www.kaggle.com/datasets/pinstripezebra/graffiti-classification. To generate this dataset 500 images were randomly sampled from the top 50 cities in the US by population.

Next, we need to get the data loaded into a format that we can work with, first, we load in the packages needed for this project, save the data to the /Data-Clean/ directory, and then read it in with the tf.keras.utils.image_dataset_from_directory method. This method generates a Tensorflow dataset from a directory and identifies the graffiti/non-graffiti images by the folder that they are placed in. This also allows us to specify a training and validation set by using subset = “training”/”validation”, and specify the image size, and batch size.

Figure 1: Dataset Size
import tensorflow
import os
import shutil
import keras
import tensorflow as tf
from keras.preprocessing.image import ImageDataGenerator
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import numpy as np
import pandas as pd
import joblib
import tensorflow_addons as tfa
import tempfile
import seaborn as sns
from os import listdir
from os.path import isfile, join
import math
from matplotlib.gridspec import GridSpec
import random

# Defining Image and Batch size
image_height, image_width = 500, 500
batch_size = 256

# Path to read data from
data_dir = './Data-Clean/'

# Defining train an validation splits using 80/20 split
train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(image_height, image_width),
batch_size=batch_size)

val_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="validation",
seed=123,
image_size=(image_height, image_width),
batch_size=batch_size)

Viewing Data:

Before we train our first model we want to take a quick look at the data and ensure it’s in the format that we were expecting. This can be easily done below where we extract the class_names from our train dataset, take a random sample using the .take() function, and then iterate through 9 images, adding them to a subplot as we go. Examining the output we can see 9 images that all look like they could have come from google streetview and we’ll notice that all of these are labeled as ‘no-graffiti’ which visually aligns with what we can see in the images, so far so good.

class_names = train_ds.class_names
plt.figure(figsize=(10, 10))
for images, labels in train_ds.take(1):
for i in range(9):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(images[i].numpy().astype("uint8"))
plt.title(class_names[labels[i]])
plt.axis("off")
Figure 5: Sample of 9 Training Images

Initial Model Definition — Naive Approach

After loading the data we then need to define a model to train, a naive approach that we could take to this would be to train a basic convolutional neural network with accuracy as our evaluation metric. This approach is shown below, we define a CNN with 3 convolutional and 3 pooling layers, a sigmoid output function, and binary cross entropy as our loss function. We then create a model for our 500 x 500 pixel input images and then train the model and save it and its training history into two different folders. This last piece is important because training time on this model was significant(~1 hour) so we don’t want to retrain the model every time we run our script.

At this point our directory should look like below: we have our Data-Clean folder with the graffiti + no-graffiti sub directories, the history + models folders to store results, and our model_training.ipynb file in the base directory.

Figure 1: File Directory
# Function to define and compile model
def create_model(image_height, image_width, cost_function):

model = keras.Sequential([
keras.layers.Rescaling(1./255),
keras.layers.Conv2D(16, (3, 3), activation='relu', input_shape=(image_height, image_width, 3)),
keras.layers.MaxPool2D((2, 2)),
keras.layers.Conv2D(32, (3, 3), activation='relu'),
keras.layers.MaxPool2D((2, 2)),
keras.layers.Conv2D(64, (3, 3), activation='relu'),
keras.layers.MaxPool2D((2, 2)),
keras.layers.Flatten(),
keras.layers.Dense(512, activation='relu'),
keras.layers.Dense(1, activation='sigmoid')
])

#Compiling
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=[cost_function])
return model

# Defining model parameters and creating model
image_height, image_width = 500, 500
naive_model = create_model(image_height, image_width, 'accuracy')

# Returning list of existing models - only want to train this model once
path = os.getcwd() + '/models'
file_list = [f for f in listdir(path) if isfile(join(path, f))]

# Training model if it doesnt already exist, otherwise loading
if "naive_model.pkl" not in file_list:
naive_history = naive_model.fit(
train_ds,
validation_data=val_ds,
epochs=5
)
joblib.dump(naive_model, './models/naive_model.pkl')
pd.DataFrame(naive_history.history).to_csv('./history/naive_history.csv')
else:
naive_model = joblib.load("./models/naive_model.pkl")

After training our model we then want to evaluate its performance, because our dataset is imbalanced we’ll do this by making predictions on the validation set and then generating a confusion matrix to evaluate the performance of each class. Because our data is stored in batches we first define a make_predictions method that iterates through each batch, makes predictions, and rounds the predicted value to 1 or 0 depending on the threshold we pass the function.

The threshold we pass is important because the output from our model is generated by a sigmoid activation function, in Figure 1 we can see that this function will generate a decimal value between 0 and 1 however we need to convert this to a binary output by rounding it depending on our chosen threshold. In our naive approach, we choose a threshold of 0.5 however depending on the value we place on false positives versus false negatives we can adjust our threshold accordingly.

Figure 2: Sigmoid Activation Function

def make_predictions(model, val_ds, threshold):

labels = np.array([])
predictions = np.array([])
# Iterating through batches and making predictions
for x, y in val_ds:
labels = np.concatenate([labels, y.numpy()])
predictions = np.concatenate([predictions,
[i[0] for i in model.predict(x).tolist()]])

# Converting to 1/0 depending on value relative to threshold
predictions = [1 if i >= threshold else 0 for i in predictions]
# Converting to int type
predictions = np.array(predictions).astype(int)
labels = np.array(labels).astype(int)

return predictions, labels

# Making predictions and plotting confusion matrix
test_predictions_baseline = my_model.predict(val_ds, batch_size=batch_size)
predictions, labels = make_predictions(naive_model, val_ds, 0.5)
plot_cm(labels, predictions, 0.5)

If we run the above code we get the below confusion matrix as our output, based on this we can see that classification accuracy on our majority class is ~99.7% versus only ~59.6% for the minority class. This makes sense because given that we’re evaluating performance based on accuracy and have a highly imbalanced dataset our model is going to prioritize correctly predicting the majority class because this will result in the highest accuracy. To increase model performance on the minority class we’ll have to make some changes.

Figure 3: Confusion Matrix for Naive Approach

Tweaking our Model, F1 Score

As shown in our naive approach if we use accuracy as the cost metric on an imbalanced dataset we’ll inevitably get a model that performs better on predicting the majority class than the minority. To address this the simplest change we can make is to change how we’re evaluating model performance, to this end, we’ll update our model to use F1 Score rather than Accuracy. F1 score is the harmonic mean of Precision and Recall and will serve to penalize a model that performs worse on a specific class. Note that this may decrease the total number of accurate predictions our model makes since some non-graffiti images will likely be misclassified as graffiti but if we want to balance performance on both classes this will help.

Figure 4: F1 Score From Sklearn Metrics

One additional change we’re going to make is experimenting with different cutoff thresholds for classifying a prediction as 0 or 1. In our previous model, we used just 0.5 as the threshold but in our new model, we’ll try 0.25, 0.5, and 0.75 to assess how the different thresholds compare.

Training F1 Model

Our training approach for the model using F1 Score is identical to what we previously did on our naive model. We call the same create model function with the only difference being we pass in tfa.metrics.F1Score as the evaluation metric rather than accuracy. Similarly, once our model is trained we save it to our /models folder and save the training history to /history folder. On subsequent runs, we’ll just load the model rather than retraining.

# Defining thresholds
low_threshold, medium_threshold, high_threshold = 0.25, 0.5, 0.75
# Initializing and compiling model, note use 0.5 threshold here
# Will incorporate threshold levels when we make predictions
medium_threshold_model = create_model(image_height, image_width,
tfa.metrics.F1Score(num_classes = 2,threshold=0.5, average="micro"))

# Training model once
if "medium_threshold_model.pkl" not in file_list:
history = medium_threshold_model.fit(
train_ds,
validation_data=val_ds,
epochs=5
)
joblib.dump(medium_threshold_model, './medium_threshold_model.pkl')
# Writing history to csv
pd.DataFrame(history.history).to_csv('./history/medium_threshold_history.csv')
else:
medium_threshold_model = joblib.load("./models/medium_threshold_model.pkl")

Assessing Performance:

Now that we’ve trained our model using F1-Score as the evaluation metric we’re going to check performance with a slightly more comprehensive approach than before. First, we’re going to take this model and make predictions using three different thresholds as discussed previously. These 3 different thresholds will result in differing rates of false positives and false negatives for each class and allow us to choose the best option for our use case. Additionally, we’re going to be graphing the training and validation scores over each epoch to assess how performance changes as our model is trained.

First, we’ll visualize the training history of our model, the function defined below takes a list of histories and generates a graph for each one. In this case, we’re only passing one model’s history so only one graph will be shown.

def visualize_f1(histories):

'''Takes dataframe containing keras training results, displays grid
of val/train F1 Score over epoch'''

graph_count = len(histories)
rows = 1
columns = int(math.ceil(graph_count/rows))
fig = plt.figure(figsize=(10, 5))
gs = GridSpec(nrows=rows, ncols=columns)
keys = list(histories.keys())
for model in range(len(histories.keys())):
ax = fig.add_subplot(gs[int(math.floor(model/columns)), model%columns])
ax.plot(histories[model]['Results'].index, histories[model]['Results']['f1_score'], label="Train f1")
ax.plot(histories[model]['Results'].index, histories[model]['Results']['val_f1_score'], label="Val f1")
plt.legend(loc="upper left")
plt.grid(True)

#Returning accuracy and formatting
accuracy = round(histories[keys[model]]['Results']['val_f1_score'].max(),4)
plt.title("Model {graph} f1 score {accuracy}".format(graph = model,
accuracy = accuracy))

plt.tight_layout()
plt.show()

# Now Defining Data Structures and calling function
models = [low_threshold_model, my_model, high_threshold_model]
history = [pd.read_csv('./history/medium_threshold_history.csv')]
histories = {}

# Aggregating model performance for comparison
for i in range(len(models)):
histories[i] = {'Results' :pd.DataFrame(history[i]),
'Model' : history,
'End Score': history[i]['val_f1_score']}

#Viualizing training histories
visualize_f1(histories)

In below Figure 6, we can see our training and validation performance across epochs, we can see that both training and validation performance leveled off after the second epoch. We could try training over more epochs to see if this trend continues but given the volume of data and training time, this is good enough.

Figure 6: F1-Score Performance across Epoch

Next, we’ll generate our confusion matrices with the three different cutoffs, 0.25, 0.5, and 0.75. We can see below that all three performed better on correctly predicting the minority class than our naive model(the naive model had 84 true positives on the minority class, we’re we have a minimum of 89). We’re able to get the best overall performance with the higher threshold of 0.75 where our minority class performance is ~75% and the majority class is almost 100%.

Figure 7: Confusion Matrix Comparison, F1 Score

F1 Score with Class Weighting

Using F1 Score significantly increased our model’s ability to correctly predict the presence of graffiti versus our baseline model (59.6% -> 75%) however we still have an approximately 25% difference in accuracy across our two classes. To further improve our model’s performance in predicting the minority class we’ll now utilize an approach called class weighting.

In our previous two approaches our models treated a misclassification the same regardless of what class they came from, i.e. a false positive on the graffiti class was just as bad as a false positive on the non-graffiti class. However, this led to models better at predicting the majority class than the minority, to address this we can manually specify the weighting that our model gives to the classes during training. If we want to increase performance in identifying the minority class we can specify a higher class weight(importance) for that class.

This approach is shown below, first we calculate the total number of data points we have, then the proportion of the data belonging to each class, and finally assign weights inversely proportional to the percent of data belonging to each class. In Figure 8 we can see exactly what this means; with 141 graffiti observations and 3986 non-graffiti our class weights are 14.63 for graffiti and 0.52 for non-graffiti. Essentially we’re weighing a graffiti observation approximately 28 times heavier than a non-graffiti one.

Figure 8: Validation Dataset breakdown

We then proceed with model training using the same approach as in the normal F1 model with the one exception being providing a dictionary containing our calculated class_weights in the model.fit() step.

# determing number of classes
total = len(labels)
no_graffiti_count = sum(labels)
graffiti_count = total - no_graffiti_count

print("Graffiti: ", graffiti_count, " No Graffiti: ", no_graffiti_count)

# Adjusting class weights
graffiti_weight = (1 / graffiti_count) * (total / 2.0)
no_graffiti_weight = (1 / no_graffiti_count) * (total / 2.0)

class_weight = {0: graffiti_weight, 1: no_graffiti_weight}
print('Weight for class 0: {:.2f}'.format(graffiti_weight))
print('Weight for class 1: {:.2f}'.format(no_graffiti_weight))

# Initializing and compiling models with updated class weights
weighted_medium_model = create_model(image_height, image_width,
tfa.metrics.F1Score(num_classes = 2,threshold = 0.5, average="micro"))
# only training if the model doesnt already exist then
if "weighted_medium_model.pkl" not in file_list:
weighted_medium_model_history = weighted_medium_model.fit(
train_ds,
validation_data=val_ds,
epochs=5,
class_weight = class_weight
)
joblib.dump(weighted_medium_model, './models/weighted_medium_model.pkl')
pd.DataFrame(weighted_medium_model_history.history).to_csv('./history/weighted_medium_model_history.csv')
else:
weighted_medium_model = joblib.load("./models/weighted_medium_model.pkl")

Evaluating Weighted F1 Score Model

Using the same approach as we did with our previous model we make predictions using a 0.25, 0.5, and 0.75 classification threshold and then generate three sets of confusion matrices and a training plot of model score over epoch.

# Making predictions with new models
low_weighted_predictions, low_weighted_labels = make_predictions(weighted_medium_model, val_ds, 0.25)
medium_weighted_predictions, medium_weighted_labels = make_predictions(weighted_medium_model, val_ds, 0.5)
high_weighted_predictions, high_weighted_labels = make_predictions(weighted_medium_model, val_ds, 0.75)
Figure 9: F1-Score Performance Across Epoch, Weighted Classes

If we compare the confusion matrices from our weighted model with the unweighted one we can see that we were able to increase the number of graffiti images correctly identified at the cost of some additional false negatives. The best-performing model/threshold combination appears to be the default 0.5 threshold where we correctly identified 113 graffiti images bringing out graffiti true positives to 80.1% while decreasing non-graffiti true positives to 98%.

Figure 10: Confusion Matrix Comparison, F1 Score with Class Weighting

F1 Score With Data Augmentation

One final approach we’re going to try is data augmentation. Data augmentation is the practice of modifying the base data in some systematic way that ideally allows a model to generalize better to unseen cases. With images, this commonly involves changing the shading, rotating images, changing the zoom, etc., and can be done either prior to training to engineer a new dataset or during training by adding augmentation layers to the neural network.

We’re going to take the latter approach since it’s the more straightforward one and use built-in data augmentation layers in Keras to augment our data before passing it to deeper layers. We do this by defining a new function create_model_augmentation, this is identical to our initial create _model function with the exception of three data augmentation layers we add to randomly flip, rotate, or contrast the image. A complete list of built-in data augmentation layers is available here: https://www.tensorflow.org/tutorials/images/data_augmentation

def create_model_augmentation(image_height, image_width,  cost_function):

# Defining data augmentation layer
data_augmentation = keras.Sequential([
keras.layers.RandomFlip("horizontal_and_vertical"),
keras.layers.RandomRotation(0.5),
keras.layers.RandomContrast(0.5)
])

#Defining model
model = keras.Sequential([
data_augmentation,
keras.layers.Rescaling(1./255),
keras.layers.Conv2D(16, (3, 3), activation='relu', input_shape=(image_height, image_width, 3)),
keras.layers.MaxPool2D((2, 2)),
keras.layers.Conv2D(32, (3, 3), activation='relu'),
keras.layers.MaxPool2D((2, 2)),
keras.layers.Conv2D(64, (3, 3), activation='relu'),
keras.layers.MaxPool2D((2, 2)),
keras.layers.Flatten(),
keras.layers.Dense(512, activation='relu'),
keras.layers.Dense(1, activation='sigmoid')
])

#Compiling
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=[cost_function])
return model

After defining our new model we performed the same training and evaluation process as before resulting in the below confusion matrix. It’s clear that this confusion matrix is markedly different than the previous ones mainly based on the number of non-graffiti images incorrectly classified as graffiti, however, we also do see an increase in the number of correctly predicted graffiti images. In total for our 0.5 threshold model we correctly classified 85.8% of Graffiti and 83.2% of non-graffiti, the first time that performance was better for our minority class than the majority one.

Figure 11: Confusion Matrix with F1-Score and Data Augmentation

Conclusion

In this article we’ve compared different approaches to building a model on an imbalanced dataset, in Figure 12 we can see a comprehensive comparison of how each approach performed in predicting our minority class, majority class, and overall. Based on this we can see the naive approach clearly isnt viable if we care about accurately predicting the minority class(<60% of graffiti instances identified) and the best performing overall was either the F1 score with weighted classes or with Data Augmentation depending on how much we’re willing to degrade accuracy on our majority class.

Figure 12: Model Comparison

Future Work

While we examined several options for dealing with imbalanced datasets there are certainly others; data augmentation in particular could be explored further with different augmentation methods(flipping, rotating, etc.). Additionally, we could combine data augmentation with SMOTE(synthetic minority oversampling) to engineer a balanced dataset using augmentation to ensure variety in our graffiti images.

Links



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*