MLOps
Over the last few years, machine learning models have seen two seemingly opposing trends. On the one hand, the models tend to get bigger and bigger, culminating in what’s all the rage these days: the large language models. Nvidia’s Megatron-Turing Natural Language Generation model has 530 billion parameters! On the other hand, these models are being deployed onto smaller and smaller devices, such as smartwatches or drones, whose memory and computing power are naturally limited by their size.
How do we squeeze ever larger models into increasingly smaller devices? The answer is model optimization: the process of compressing the model in size and reducing its latency. In this article, we will see how it works and how to implement two popular model optimization methods — quantization and pruning — in TensorFlow.
Before we jump to model optimization techniques, we need a toy model to be optimized. Let’s train a simple binary classifier to differentiate between Paris’ two famous landmarks: the Eiffel Tower and the Mona Lisa, as drawn by the players of Google’s game called “Quick, Draw!”. The QuickDraw dataset consists of 28×28 grayscale images.
Let’s train a simple convolutional network to classify the two landmarks.
def get_model():
return tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(28, 28)),
tf.keras.layers.Reshape(target_shape=(28, 28, 1)),
tf.keras.layers.Conv2D(
filters=12, kernel_size=(3, 3), activation="relu"
),
tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
tf.keras.layers.Conv2D(
filters=24, kernel_size=(3, 3), activation="relu"
),
tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(1, activation="sigmoid")
])model_baseline = get_model()
model_baseline.compile(
optimizer="adam"…
Be the first to comment