TinyML — Convolutional Neural Networks (CNN) | by Thommaskevin | May, 2024


From mathematical foundations to edge implementation

👨🏽‍💻 Github: thommaskevin/TinyML (github.com)
👷🏾 Linkedin: Thommas Kevin | LinkedIn
📽 Youtube: Thommas Kevin — YouTube
👨🏻‍🏫 Research group: Conecta.ai (ufrn.br)

SUMMARY

1 — Convolutional Neural Networks History
2 —Convolutional Neural Networks Theory
2.1 — Convolution Layer
2.2 — Padding Layer
2.3 —Pooling Layer
2.4 — Flattening Layer
3 — TinyML Implementation

The concept of Convolutional Neural Networks (CNN) began to take shape in the 1980s with the work of Kunihiko Fukushima, who developed the Neocognitron. Inspired by the structure of the visual system in animals, the Neocognitron had a hierarchical architecture capable of learning to recognize visual patterns through a process of self-organization. This work was a significant precursor to the development of modern CNNs.

Kunihiko Fukushima

The modern architecture of CNNs was proposed by Yann LeCun and his colleagues in the late 1980s and early 1990s. They developed LeNet-5, a convolutional neural network designed for recognizing handwritten digits in the MNIST dataset. LeNet-5 consisted of several convolutional layers followed by pooling layers and fully connected layers, establishing the basis for the architecture of CNNs used today.

Yann LeCun

Despite initial success, the use of CNNs was limited by computational constraints and the lack of large labeled datasets. However, as computing power increased and techniques like training deep networks with GPUs (graphics processing units) became feasible, CNNs began to gain more attention. Furthermore, the development of large labeled image databases, such as ImageNet, provided the necessary material to train deep networks effectively.

A breakthrough occurred in 2012 when a CNN called AlexNet, developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) by a significant margin over competitors. AlexNet utilized several convolutional layers, ReLU activation functions, and regularization techniques such as dropout, demonstrating the power of CNNs for visual recognition tasks. From that point on, CNNs became the primary tool for a wide range of applications, leading to continuous innovations. Models like VGGNet, GoogLeNet (Inception), and ResNet introduced new architectures and techniques to improve network depth, efficiency, and accuracy.

Ilya Sutskever (left), Alex Krizhevsky (centre), Geoffrey Hinton (right)

Today, CNNs are an essential component of many artificial intelligence systems. They are used not only in image recognition but also in video analysis, natural language processing, medical diagnostics, autonomous vehicles, and many other areas. Research continues to advance, with innovations including more efficient convolutional neural networks, deep neural networks (DNNs), and generative adversarial networks (GANs), among others.

In mathematics, “convolution” is an integral operation that illustrates the transformation of one function by another. However, within the context of neural networks, this concept diverges from traditional statistical interpretations.

Fundamentally, we start with an input function, typically an image in our scenario. Additionally, we introduce a filter (also known as a kernel), a function through which our image undergoes transformation via a dot product operation, commonly termed “convolution.” Subsequently, upon applying these filters to the input image, we obtain output images referred to as “feature maps”.

2.1 — Convolution Layer

The convolutional layer is where the image is processed to detect patterns and generate feature maps through filters (kernels). These feature maps represent each attribute that the filters aim to identify. Filters are usually composed of (3×3) or (5×5) matrices, with each filter occupying an equally sized field in the input image. Then, the filter is shifted one cell horizontally, and the same process is repeated. When it reaches the horizontal end, the filter is moved one cell down, and the dot product process continues to be applied horizontally again. The results are then added to the output in the same order to create the feature map.

Convolution layer process

Suppose we have an input image described by a tensor I with dimensions m1 x m2 x mc. In this tensor,

We apply a filter, which is also a tensor with dimensions (n1 x n2 x nc), where nc represents the number of channels, matching that of the input image. This filter moves across the image from left to right, performing element-wise multiplication with the corresponding region of the input tensor I and summing up those products. The stride parameter determines the step size by which the filter moves across the image. The result of this operation between I and K yields another tensor with dimensions (m1 — n1 + 1) x (m2 — n2 + 1) x 1.

And,

The (i, j)-th entry of the feature map is calculated as follows:

We’ve chosen the following example where a 5x5x1 dimensional image is convoluted with a kernel of 3x3x1, and a stride of s=1 is employed.

The (i, j)-th entry of the feature map is given by the following general formula for a single channel:

Let’s compute the (1, 1)-th entry of the feature map in the above example:

The entries that are not available are substituted with zero.

Similarly, the remaining entries can be calculated using the same formula. This process is repeated by applying different types of filters, each capturing different features of the image such as blur or sharpness. Here, the number of filters can be more than one, which introduces the concept of stride.

2.2— Padding Layer

A basic CNN provides results for a grayscale image of size (n x n) with a filter/kernel size of (f x f), resulting in an output size of (n — f + 1) x (n — f + 1). For instance, in any convolution operation with an (8 x 8) image and a (3 x 3) filter, the output image size will be (6 x 6). This reduction in size occurs consistently during image processing, as the output of the layers is typically smaller than the input. Additionally, the filters used in convolution operations do not always focus on the corners as they move across the pixels.

There are several types of padding commonly used in machine learning:

  1. Same Padding: Same padding involves adding extra elements, typically zeros, to the outer frame of the original image. By expanding the input in this manner, the filter is able to scan a larger area, ensuring that the output images remain the same size as the original. This is particularly useful for maintaining spatial dimensions during convolution operations.
  2. Valid Padding: In contrast to same padding, valid padding does not involve adding any extra elements to the image. The filter scans through the original image without any additional elements. While this may lead to some data loss due to strides, valid padding is often used when it’s desirable to reduce the size of the output feature map. This reduction can help decrease the number of parameters in the model and improve computational efficiency.
  3. Causal Padding: Causal padding is primarily used in sequence-to-sequence models and time series forecasting, particularly with one-dimensional convolutional layers. This type of padding adds elements to the beginning of the data sequence, enabling the algorithm to forecast values for early time steps. By incorporating past and present data for prediction, causal padding ensures that the model does not utilize future data, which may not be available during inference.
  4. Full Padding: This type of padding involves adding more than one layer of zeros around the borders of the input, resulting in an output feature map that is larger than the original image size. Full padding is less common but can be used in certain scenarios where a larger output size is desired.

2.3 —Pooling Layer

In the pooling layer, the spatial dimensions of convolved features are typically reduced, helping to extract dominant features from the input image. This reduction in size is achieved by applying a pooling function to the output obtained from the convolutional layer. Let’s assume:

The dimension of pooled part is given as :

There are 3 types of pooling in deep learning:

Average Pooling: Average of the pixel values in the covered field is passed into the output matrix.

Max Pooling: Highest of the pixel values in the covered field is passed into the output matrix.

Difference between Max Pooling and Average Pooling

Global Max Pooling: Highest of the pixel values in the all input size is passed into the output matrix. In this type of pooling, pool size is equal to the input size.

Difference between Max Pooling and Global Max Pooling

There are different types of pooling such as sum pooling , average pooling , max pooling. Example of max pooling is given below. The maximum pooling is done upon 2×2 patches. From each patch , the maximum entry is selected.

2.4 — Flattening Layer

The flattening layer is a crucial component in a neural network architecture, especially in the transition from convolutional layers to fully connected layers. This layer essentially converts the multi-dimensional feature maps generated by the convolutional and pooling layers into a one-dimensional vector, which can be fed into the subsequent fully connected layers for classification or regression tasks.

Here’s how the flattening layer works:

  1. Input: The input to the flattening layer is typically a multi-dimensional tensor representing the feature maps generated by the previous convolutional or pooling layers. For example, if the last convolutional or pooling layer produces feature maps of dimensions (height, width, depth), the input tensor would have the shape (batch_size, height, width, depth).
  2. Flattening: The flattening layer reshapes the input tensor into a one-dimensional vector by simply concatenating all the elements of the feature maps along a single dimension. For example, if the feature maps have dimensions (height, width, depth), the flattening layer would reshape them into a vector of length height * width * depth.
  3. Output: The output of the flattening layer is a one-dimensional vector that represents the flattened feature maps. This vector can then be passed as input to the subsequent fully connected layers.

The purpose of the flattening layer is to transform the spatial information captured in the feature maps into a format that can be processed by fully connected layers, which require one-dimensional input vectors. By flattening the feature maps, the neural network can effectively learn complex patterns and relationships in the data across different spatial locations, leading to more accurate predictions.

Suppose we have a set of feature maps 𝐹 generated by the previous convolutional or pooling layers. Let’s denote the dimensions of these feature maps as follows:

  • 𝐻: Height of the feature maps
  • 𝑊: Width of the feature maps
  • 𝐷: Depth (number of channels) of the feature maps
  • 𝐵: Batch size (number of samples in the batch)

So, the shape of the feature maps 𝐹 would be (B,H,W,D), where 𝐵 represents the batch size.

To flatten these feature maps into a one-dimensional vector, we simply reshape them into a vector of length 𝐻×𝑊×𝐷. This can be expressed mathematically as:

Flatten(𝐹)=reshape(𝐹,(𝐵,𝐻×𝑊×𝐷))

Here, the reshape operation reshapes the (𝐵,𝐻,𝑊,𝐷) tensor into a (𝐵,𝐻×𝑊×𝐷) tensor, effectively flattening the spatial dimensions into a single dimension.

For example, if 𝐹 has dimensions (4,5,5,3) (batch size of 4, feature maps with height 5, width 5, and depth 3), then the flattened output would have dimensions (4,75), where each row represents a flattened feature map for one sample in the batch.

This flattened vector can then be passed as input to the subsequent fully connected layers in the neural network.

With this example you can implement the machine learning algorithm in ESP32, Arduino, Raspberry and other different microcontrollers or IoT devices.

3.0 — Install the libraries listed in the requirements.txt file

!pip install -r requirements.txt

3.1 — Importing libraries

import numpy as np
from sklearn.datasets import load_digits
import tensorflow as tf
from tensorflow.keras import layers
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from matplotlib import pyplot as plt
import time
import seaborn as sns
import os

3.2 —Load Dataset

MNIST, short for Modified National Institute of Standards and Technology database, is a widely used dataset in the field of machine learning and computer vision. It consists of a collection of handwritten digits ranging from 0 to 9, each digit being represented as a grayscale image of size 28×28 pixels. The dataset contains a total of 70,000 images, with 60,000 images used for training and 10,000 images used for testing.

link: https://www.nist.gov/itl/products-and-services/emnist-dataset

def get_data():
np.random.seed(1337)
x_values, y_values = load_digits(return_X_y=True)
x_values /= x_values.max()
# reshape to (8 x 8 x 1)
x_values = x_values.reshape((len(x_values), 8, 8, 1))
# split into train, validation, test
TRAIN_SPLIT = int(0.6 * len(x_values))
TEST_SPLIT = int(0.2 * len(x_values) + TRAIN_SPLIT)
x_train, x_test, x_validate = np.split(x_values, [TRAIN_SPLIT, TEST_SPLIT])
y_train, y_test, y_validate = np.split(y_values, [TRAIN_SPLIT, TEST_SPLIT])

return x_train, x_test, x_validate, y_train, y_test, y_validate

3.3 — Splitting the data

X_train, X_test, X_validate, y_train, y_test, y_validate = get_data()

3.4 —Exploratory Data Analysis

X_train__ = X_train.reshape(X_train.shape[0], 8, 8)

fig, axis = plt.subplots(1, 4, figsize=(20, 10))
for i, ax in enumerate(axis.flat):
ax.imshow(X_train__[i], cmap='binary')
digit = y_train[i]
ax.set(title = f"Real Number is {digit}")

3.5— Define the model

model = tf.keras.Sequential()
model.add(layers.Conv2D(8, (3, 3), activation='relu', input_shape=(8, 8, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(len(np.unique(y_train))))
model.summary()
plot_model(model, to_file='./figures/model.png')

3.6 —Compile the model

model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])

3.7 —Training model

history =  model.fit(X_train, y_train,
epochs=50,
batch_size=16,
validation_data=(X_validate, y_validate))
model.save('.\models\model.keras')
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(loss) + 1)
plt.plot(epochs, loss, 'r.', label='Training loss')
plt.plot(epochs, val_loss, 'y', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.grid()
plt.legend()
plt.savefig('.\\figures\\history_traing.png', dpi=300, bbox_inches='tight')
plt.show()

3.8 — Model Evaluation

3.8.1 — Test Data

def test_model(model, x_test, y_test):
x_test = (x_test / x_test.max()).reshape((len(x_test), 8, 8, 1))
y_pred = model.predict(x_test).argmax(axis=1)
print('ACCURACY', ((y_pred == y_test).sum() / len(y_test))*100, "%")
test_model(model, X_test, y_test)

3.8.2 — Confusion matrix

fig = plt.figure(figsize=(10, 10)) # Set Figure

y_pred = model.predict(X_test) # Predict class probabilities as 2 => [0.1, 0, 0.9, 0, 0, 0, 0, 0, 0, 0]
Y_pred = np.argmax(y_pred, 1) # Decode Predicted labels
mat = confusion_matrix(y_test, Y_pred) # Confusion matrix

# Plot Confusion matrix
sns.heatmap(mat.T, square=True, annot=True, cbar=False, cmap=plt.cm.Blues, fmt='.0f',
xticklabels=np.unique(y_test), yticklabels=np.unique(y_test),
annot_kws={"fontsize": 14}, linewidths=1, linecolor='white')

plt.xlabel('Predicted Values', fontsize=14)
plt.ylabel('True Values', fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.savefig('.\\figures\\confusion_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

3.8.3— Prediction validation results

y_pred = model.predict(X_test)
X_test__ = X_test

fig, axis = plt.subplots(4, 4, figsize=(12, 14))
for i, ax in enumerate(axis.flat):
ax.imshow(X_test__[i], cmap='binary')
ax.set(title = f"Real Number is {y_test[i]}\nPredict Number is {y_pred[i].argmax()}")

3.9 — Obtaining the model to be implemented in the microcontroller

3.9.1 — Convert some hex value into an array for C programming

# Function: Convert some hex value into an array for C programming
def hex_to_c_array(hex_data, var_name):

c_str = ''

# Create header guard
c_str += '#ifdef __has_attribute\n'
c_str += '#define HAVE_ATTRIBUTE(x) __has_attribute(x)\n'
c_str += '#else\n'
c_str += '#define HAVE_ATTRIBUTE(x) 0\n'
c_str += '#endif\n'
c_str += '#if HAVE_ATTRIBUTE(aligned) || (defined(__GNUC__) && !defined(__clang__))\n'
c_str += '#define DATA_ALIGN_ATTRIBUTE __attribute__((aligned(4)))\n'
c_str += '#else\n'
c_str += '#define DATA_ALIGN_ATTRIBUTE\n'
c_str += '#endif\n\n'

# Declare C variable
c_str += 'const unsigned char ' + var_name + '[] DATA_ALIGN_ATTRIBUTE = {'
hex_array = []
for i, val in enumerate(hex_data) :

# Construct string from hex
hex_str = format(val, '#04x')

# Add formatting so each line stays within 80 characters
if (i + 1) < len(hex_data):
hex_str += ','
if (i + 1) % 12 == 0:
hex_str += '\n '
hex_array.append(hex_str)

# Add closing brace
c_str += '\n ' + format(' '.join(hex_array)) + '\n};\n\n'

# Close out header guard
c_str += 'const int ' + var_name + '_len = ' + str(len(hex_data)) + ';\n'

return c_str

3.9.2—Convert o model to Float32 and Int8

def representative_dataset():
for i in range(len(X_train)):
input_data = np.array([X_train[i]], dtype=np.float32)
yield [input_data]

def converter_quantization_model(model, model_name):

# Convert o model to float32
converter_float32 = tf.lite.TFLiteConverter.from_keras_model(model)
converter_float32.optimizations = [tf.lite.Optimize.DEFAULT]
converter_float32.target_spec.supported_types = [tf.float32]
converter_float32._experimental_lower_tensor_list_ops = False
converter_float32.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS, tf.lite.OpsSet.SELECT_TF_OPS]
converter_float32.representative_dataset = representative_dataset
tflite_model_float32 = converter_float32.convert()
print(tflite_model_float32)
with open(model_name+'_quant_float32' + '.h', 'w') as file:
file.write(hex_to_c_array(tflite_model_float32, model_name+'_quant_float32'))
with open(model_name+'_quant_float32.tflite', 'wb') as f:
f.write(tflite_model_float32)
size_model_tflite_float32 = os.path.getsize(model_name+'_quant_float32.tflite')
print(model_name+f'_quant_float32.tflite: {size_model_tflite_float32} Bytes')

# Convert o model to Int8
converter_int8 = tf.lite.TFLiteConverter.from_keras_model(model)
converter_int8.optimizations = [tf.lite.Optimize.DEFAULT]
converter_int8.target_spec.supported_types = [tf.int8]
#converter_int8._experimental_lower_tensor_list_ops = False
converter_int8.representative_dataset = representative_dataset
converter_int8.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS_INT8,
tf.lite.OpsSet.SELECT_TF_OPS,
]
converter_int8.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS]
converter_int8.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter_int8.experimental_new_converter = True
converter_int8.experimental_new_quantizer = True
converter_int8.experimental_new_calibrator = True
tflite_model_int8 = converter_int8.convert()
with open(model_name+'_quant_int8' + '.h', 'w') as file:
file.write(hex_to_c_array(tflite_model_int8, model_name+'_quant_int8'))
with open(model_name+'_quant_int8.tflite', 'wb') as f:
f.write(tflite_model_int8)
size_model_tflite_int8 = os.path.getsize(model_name+'_quant_int8.tflite')
print(model_name+f'_quant_int8.tflite: {size_model_tflite_int8} Bytes')

return None

model_name='.\models\model'
converter_quantization_model(model, model_name)

3.10 — Quantized Model Evaluation

def evaluate_quantization(model_path, X_test, y_test, quantization_type):
interpreter = tf.lite.Interpreter(model_path=model_path)
interpreter.allocate_tensors()

# Avaliar o modelo quantizado
input_index = interpreter.get_input_details()[0]['index']
output_index = interpreter.get_output_details()[0]['index']
predictions = []
processing_times = []

X_test = np.array(X_test, dtype=np.float32)

for X in X_test:
interpreter.set_tensor(input_index, [X])
start_time = time.time()
interpreter.invoke()
end_time = time.time()
processing_time = end_time - start_time
processing_times.append(processing_time)
output = interpreter.get_tensor(output_index).argmax(axis=1)
predictions.append(output[0])

acc = accuracy_score(y_test, predictions)

# Calcular a média e o desvio padrão das diferenças
result = { "Accuracy (%): ":acc*100,
"Process time (s): ": np.mean(processing_times)
}

return result

model_name = '.\models\model'
eval_quant_float32 = evaluate_quantization(model_name + '_quant_float32.tflite', X_test, y_test, 'float32')
eval_quant_float32
eval_quant_int8 = evaluate_quantization(model_name + '_quant_int8.tflite', X_test, y_test, 'int8')
eval_quant_int8

3.11 — Deploy Model

With this example you can implement the machine learning algorithm in ESP32, Arduino, Arduino Portenta H7 with Vision Shield, Raspberry and other different microcontrollers or IoT devices.

3.11.1 — Install EloquentTinyML Libarie

Go to library folder and install the EloquentTinyML-main

3.11.2 — Complete Arduino Sketch

Open the model_quant_float32.h or model_quant_int8.h and copy all hex value from:

and model len

and cut in model.h:

and

3.11.2 — Complete Arduino Sketch

#include <EloquentTinyML.h>
#include <eloquent_tinyml/tensorflow.h>

// sine_model.h contains the array you exported from Python with xxd or tinymlgen
#include "model.h"

#define N_INPUTS 64
#define N_OUTPUTS 10
// in future projects you may need to tweak this value: it's a trial and error process
#define TENSOR_ARENA_SIZE 6*1024

Eloquent::TinyML::TensorFlow::TensorFlow<N_INPUTS, N_OUTPUTS, TENSOR_ARENA_SIZE> tf;

float input[64] = {0.00000000000f, 0.12500000000f, 0.00000000000f, 0.50000000000f, 0.56250000000f, 0.00000000000f, 0.00000000000f, 0.00000000000f, 0.00000000000f, 0.81250000000f, 0.31250000000f, 0.87500000000f, 0.50000000000f, 0.43750000000f, 0.00000000000f, 0.00000000000f, 0.00000000000f, 0.75000000000f, 0.31250000000f, 0.12500000000f, 0.00000000000f, 0.56250000000f, 0.00000000000f, 0.00000000000f, 0.00000000000f, 0.43750000000f, 0.31250000000f, 0.00000000000f, 0.00000000000f, 0.18750000000f, 0.31250000000f, 0.00000000000f, 0.00000000000f, 0.18750000000f, 0.62500000000f, 0.00000000000f, 0.00000000000f, 0.12500000000f, 0.62500000000f, 0.00000000000f, 0.00000000000f, 0.06250000000f, 0.81250000000f, 0.00000000000f, 0.00000000000f, 0.06250000000f, 0.75000000000f, 0.00000000000f, 0.00000000000f, 0.00000000000f, 0.31250000000f, 0.81250000000f, 0.31250000000f, 0.56250000000f, 0.81250000000f, 0.00000000000f, 0.00000000000f, 0.00000000000f, 0.00000000000f, 0.56250000000f, 1.00000000000f, 1.00000000000f, 0.43750000000f, 0.00000000000f};

float y_pred[10] = {0};

void setup() {
Serial.begin(9600);
delay(4000);
tf.begin(model);

// check if model loaded fine
if (!tf.isOk()) {
Serial.print("ERROR: ");
Serial.println(tf.getErrorMessage());

while (true) delay(1000);
}
}

void loop() {

tf.predict(input, y_pred);
for (int i = 0; i < 10; i++) {
Serial.print(y_pred[i]);
Serial.print(i == 9 ? '\n' : ',');
}
Serial.print("Predicted class is: ");
Serial.println(tf.probaToClass(y_pred));
// or you can skip the predict() method and call directly predictClass()
Serial.print("Sanity check: ");
Serial.println(tf.predictClass(input));
delay(2000);

}

3.12 — Results

3.12.1 — Quantized Model Float32

3.12.1 — Quantized Model Int8

Full project in: TinyML/13_CNN at main · thommaskevin/TinyML (github.com)

If you like it, consider buying my coffee ☕️💰 (Bitcoin)

code: bc1qzydjy4m9yhmjjrkgtrzhsgmkq79qenvcvc7qzn



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*