
We are gonna use this kaggle dataset to train our model. Contains 21600 images of left and right hands fingers. All images are 128×128 pixels in resolution.
- Training set: 18000 images
- Test set: 3600 images
- Images are centered by the center of mass
- Noise pattern on the background
Labels are the second last character of the file name: 0,1,2,3,4,5
indicate number of fingers.
We will use OpenCV — an open-source library for computer vision and image processing, to read our input data and convert them into simple black and white images for easier reading of our images.
# Importing libararies
import os
import cv2
import glob
import keras
import random
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
from keras.layers import Input,Dense,Conv2D,MaxPooling2D,BatchNormalization,Flatten,Dropout
# Specifying the global variables - Download the dataset in the same directory as this notebook
TRAIN_PATH = './fingers/train/*'
TEST_PATH = './fingers/test/*'
FILE_NAME = 'realtime_fingers_detection.hdf5'
MODEL_PATH = 'realtime_fingers_model/'
KERNEL = (3,3)
CLASSES = 6
IMAGE_SIZE = 128
BATCH_SIZE = 32
EPOCHS = 50
Utility functions for loading the dataset as a list of tuples of (label, img) where label
is extracted from the name of the image and img
is a grayscale image, read using cv2.imread
and converted to a numpy ndarray of size IMAGE_SIZE
by IMAGE_SIZE
. The show_images
function take any dataset of (label, img) and show the images with their labels in a grid format.
def load_dataset(path):
dataset = [(int(img_path[-6]), load_image(img_path)) for img_path in glob.glob(path)]
return datasetdef load_image(path):
img = cv2.imread(path, cv2.IMREAD_GRAYSCALE).astype(np.uint8)
img = np.reshape(img, (IMAGE_SIZE, IMAGE_SIZE))
return img
def show_images(dataset, grid_size=5):
_, axes = plt.subplots(grid_size, grid_size, figsize=(12, 12))
axes = [y for x in axes for y in x]
for (i, (label, img)) in enumerate(random.sample(dataset, grid_size**2)):
axes[i].imshow(img, cmap='gray')
axes[i].set_title(label)
axes[i].axis('off')
Loading the training and testing data in python. And checking the training data by printing the images
training_set = load_dataset(TRAIN_PATH)
testing_set = load_dataset(TEST_PATH)
show_images(training_set)
Output:
For real time image processing we are going to convert the video feed from the web cam to a black and white image and then pass it to the model for inference. So we have to train our model on that data. We will now convert our grayscale images to black and white masks.
def process_image(img, thresh_low=80, thresh_high=255):
img = cv2.resize(img, (IMAGE_SIZE, IMAGE_SIZE))
img = cv2.GaussianBlur(img, (5, 5), 0)
_, img = cv2.threshold(img, thresh_low, thresh_high, cv2.THRESH_BINARY)
im_floodfill = img.copy()
h, w = img.shape[:2]
mask = np.zeros((h+2, w+2), np.uint8)
cv2.floodFill(im_floodfill, mask, (0,0), 255)
im_floodfill_inv = cv2.bitwise_not(im_floodfill)
img = img | im_floodfill_inv
img = img/255
img = np.reshape(img, (IMAGE_SIZE, IMAGE_SIZE, 1))
return imgdef process_dataset(dataset):
dataset = [(label, process_image(img)) for (label, img) in dataset]
return dataset
training_set = process_dataset(training_set)
testing_set = process_dataset(testing_set)
show_images(training_set)
Output:
Now Splitting the dataset in X_train
, Y_train
, X_test
and Y_test
from the training_set and testing_set respectively.
X_train = np.array([img for (_, img) in training_set])
Y_train = keras.utils.to_categorical([label for (label, _) in training_set], num_classes=CLASSES)
X_test = np.array([img for (_, img) in testing_set])
Y_test = keras.utils.to_categorical([label for (label, _) in testing_set], num_classes=CLASSES)print(X_train.shape)
print(Y_train.shape)
Output:
(18000, 128, 128, 1)
(18000, 6)
Generating modified data using keras ImageDataGenerator
class. This creates a data generator that can augment the data (e.g., random flipping, cropping) on the fly during training, potentially improving model robustness.
img_generator = keras.preprocessing.image.ImageDataGenerator(
rotation_range=45,
zoom_range=0.2,
width_shift_range=0.05,
height_shift_range=0.05,
shear_range = 0.1,
horizontal_flip=False,
fill_mode="nearest"
)
The below code defines a CNN for image classification. The model consists of several layers stacked sequentially:
- Input Layer:
Input(input_shape)
: This defines the input layer that receives images of the specified shape [IMAGE_SIZE, IMAGE_SIZE, 1]. The channel depth is 1, which signifies grayscale images. For RGB images, this value would be 3.
2. Convolutional Layers:
Conv2D(filters, kernel_size, strides=(1, 1), activation='relu')
: This defines a convolutional layer with the following parameters:filters
: This defines the number of filters used in the convolution operation (here, 64, 128, 256, and 64 at different stages). Each filter learns to detect specific features in the image.kernel_size
: This is the size of the filter (3, 3). It determines the extent of the image neighborhood the filter slides across to detect features.strides
: This defines the step size of the filter (1, 1 in this case), indicating it moves one pixel at a time.activation='relu'
: This applies the ReLU (Rectified Linear Unit) activation function that introduces non-linearity. It essentially sets negative values to zero.
3. Pooling Layers:
MaxPooling2D((2, 2))
: This applies max pooling with a window size of 2×2. It downsamples the feature maps by taking the maximum value from each 2×2 block, reducing dimensionality and potentially improving model robustness to slight shifts.
4. BatchNormalization Layers:
BatchNormalization()
: This normalizes the activations of the previous layer across the mini-batch, improving training stability and speed.
5. Dropout Layer:
Dropout(0.2)
: This introduces dropout, a regularization technique that randomly drops 20% of the activations during training. This helps prevent overfitting by forcing the network to not rely on any specific feature too much.
6. Flatten Layer:
Flatten()
: This flattens the multi-dimensional output from the convolutional layers into a 1D vector suitable for feeding into the fully connected layer.
7. Fully Connected Layer:
Dense(CLASSES, activation='softmax')
: This defines a fully connected layer with the number of neurons equal to the number of image classes (CLASSES). The softmax activation function converts the outputs into probability scores for each class.
8. Model Compilation:model.compile()
: It is an activity performed before training starts. It checks for format errors, and defines the loss function, the optimizer or learning rate, and the metrics. A compiled model is needed for training but not necessary for predicting.
def create_model():
input_shape=[IMAGE_SIZE, IMAGE_SIZE, 1]model = keras.models.Sequential([
Input(input_shape),
Conv2D(64, KERNEL, strides=(1, 1), activation='relu'),
MaxPooling2D((2, 2)),
BatchNormalization(),
Conv2D(128, KERNEL, strides=(1, 1), activation='relu'),
MaxPooling2D((2, 2)),
BatchNormalization(),
Conv2D(256, KERNEL, strides=(1, 1), activation='relu'),
MaxPooling2D((2, 2)),
BatchNormalization(),
Conv2D(64, KERNEL, strides=(1, 1), activation='relu'),
MaxPooling2D((2, 2)),
BatchNormalization(),
Dropout(0.2),
Flatten(),
Dense(CLASSES, activation="softmax"),
])
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=1e-4),
loss="categorical_crossentropy",
metrics=["accuracy"],
)
return model
model = create_model()
model.summary()
Output:
If you already have a trained model skip this step and go to Step 6. This steps trains the model and saves it locally in your machine. This step took me around 2.5hrs to train the model completely. I have uploaded my trained model on the github repo.
The below code fits the CNN model on the training data we prepared by the img_generator
, using batches of size BATCH_SIZE
and iterating for EPOCHS
number of epochs. It also monitors the validation data during training.
The callbacks_list
contains helpers that monitor and potentially adjust the training process:
- ModelCheckpoint: Saves the best model based on the training loss.
- ReduceLROnPlateau: Reduces the learning rate if the training loss stagnates.
- EarlyStopping: Stops training early if the validation loss doesn’t improve, preventing overfitting.
checkpoint = ModelCheckpoint(FILE_NAME, monitor='loss', verbose=1, save_best_only=True, mode='min')
reduce_lr = ReduceLROnPlateau(monitor='loss', verbose=1, factor=0.5, patience=1, min_lr=0.0001, mode='min')
earlyStopping = EarlyStopping(monitor='val_loss', verbose=1, min_delta=0, restore_best_weights = True, patience=3, mode='min')
callbacks_list = [checkpoint, earlyStopping, reduce_lr]history = model.fit(
x = img_generator.flow(X_train, Y_train, batch_size=BATCH_SIZE),
steps_per_epoch= X_train.shape[0] // BATCH_SIZE,
validation_data= img_generator.flow(X_test, Y_test, batch_size=BATCH_SIZE),
validation_steps= X_test.shape[0] // BATCH_SIZE,
epochs= EPOCHS,
callbacks= callbacks_list
)
Output of ending epochs and EarlyStopping:
...
Epoch 18/50
562/562 [==============================] - 642s 1s/step - loss: 0.0107 - accuracy: 0.9979 - val_loss: 0.0088 - val_accuracy: 0.9992 - lr: 1.0000e-04
Epoch 19/50
562/562 [==============================] - 642s 1s/step - loss: 0.0099 - accuracy: 0.9979 - val_loss: 0.0120 - val_accuracy: 0.9975 - lr: 1.0000e-04
Epoch 20/50
562/562 [==============================] - ETA: 0s - loss: 0.0096 - accuracy: 0.9979
Restoring model weights from the end of the best epoch: 17.
562/562 [==============================] - 616s 1s/step - loss: 0.0096 - accuracy: 0.9979 - val_loss: 0.0101 - val_accuracy: 0.9980 - lr: 1.0000e-04
Epoch 20: early stopping
Saving the model locally. This command will save the weights, variables and assets of the model in the directory MODEL_PATH
tf.saved_model.save(model, MODEL_PATH)
If you already have a trained model use the command below:
model.load_weights(MODEL_PATH)
Now we have a trained model. To evaluate our model’s accuracy and loss we run the following block of code.
loss, acc = model.evaluate(X_test, Y_test)
Output:
113/113 [==============================] - 28s 245ms/step - loss: 0.0088 - accuracy: 0.9997
So our model have an accuracy of 99.97% on the test dataset and the loss is 0.0088.
Plotting the metrics (This code will only work if you have trained the model in Step 5. On skipping Step 5, we do not have the history of the training process and we cannot plot the graph)
plt.figure(figsize=(16, 6))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model Loss and Accuracy')
plt.ylabel('Value')
plt.xlabel('Epoch')
plt.legend(['Train Loss', 'Validation Loss', "Accuracy", "Validation Accuracy"], loc='upper left')
plt.savefig('training.png')
plt.show()
Creating a class to get the number of fingers from the input video stream
class FingerClassifier(object):
def __init__(self, model_object):
self.detect = model_objectdef get_classification(self, img):
img = img.reshape(1, *img.shape)
img = tf.constant(img, dtype=float)
unique, counts = np.unique(img, return_counts=True)
# When number of white pixels is less than 1200 we return -1
if (len(counts) <= 1 or counts[1] < 1200):
return -1;
# Detect will return a tensor array which contains the probability
# of all the 6 classifications for the given image and we take the
# argument with the highest probability
output = self.detect(img)
return np.argmax(output)
obj = FingerClassifier(model)
The below function adds a black border to the input video frame to make it similar to the training images, to get better results.
def process_stream_image(img):
# Creating a border around the img to make the input similar to the training images
img = cv2.copyMakeBorder(img.copy(), 50, 50, 50, 50, cv2.BORDER_CONSTANT, value=[255, 255, 255])
img = process_image(img, thresh_low=70, thresh_high=250)
# Setting all values which are not 1, equal to 0
img[img < 1] = 0
return img
Finally, we’re ready to bring it all together for real-time finger detection! We’ll used OpenCV to efficiently capture video frames and make predictions using our trained CNN model. To optimize processing speed, we’ll focus on a specific region of the camera feed instead of the entire frame. By reducing the number of pixels analyzed, we can achieve faster processing and potentially improve accuracy.
camMarginX = 10
camMarginY = 10
scale = 10
# The Upper & Lower bound decides which color range to be taken
LB = np.array([0, 90, 0])
UB = np.array([180, 220, 255])
# This is the cropped area of the video where the hand needs to be shown
roi = (400, 120, 250, 250)
rval = Truecam = cv2.VideoCapture(0)
while rval:
# img contains the current frame
rval, img = cam.read()
img = cv2.flip(img, 1)
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# cropped is the region in focus
x, y, w, h = roi
cropped = img[y:y+h, x:x+w]
cv2.imshow('Original', cropped)
# After processing: The image is now converted to only black and white
cropped = process_stream_image(cropped)
cv2.imshow('Mask', cropped)
fingers = obj.get_classification(cropped)
print(fingers, end=' ')
# Select the Original window and press Esc key to close the program
if(cv2.waitKey(25) & 0xFF == 27):
break;
cam.release()
cv2.destroyAllWindows()
Output:
Congratulations! You’ve successfully built a custom CNN for real-time finger detection. This is a significant accomplishment, equipping you with a powerful tool for various computer vision applications.
Remember, this is just the beginning. You can further enhance your model by:
- Experimenting with different architectures and hyperparameters.
- Trying out more sophisticated techniques like transfer learning with pre-trained models.
- Exploring various data augmentation methods to improve model robustness.
As you dive deeper into the world of CNNs, keep exploring, keep learning, and keep building!
You can find my full code here
Thank you and Keep Building!!
Be the first to comment