
The problem I try to solve here is when we want to load audio files and preprocess them dynamically (in a lazy“ manner) to reduce the resources needed to train a model. We will load audio files and split each of them into equal parts, create spectrograms and load them to the model using Tensor Flow.
The full notebook is available here !
We are working with audio files. All the files are categorized in folders with the category name. This is usually how audio datasets are arranged (screenshot bellow). In my example I use from Kaggle BirdCLEF 2023. I have only considered 11 of the bird species. The names of the folders with audio recordings represent those species.
We want to create a model that learns their songs and predicts the bird once the test data is presented.
we need to split each file into 5 seconds parts and get a spectrogram for each 5sec part. then those spectrograms needs to be shuffled and fed into the model for training , evaluation and test.
We want to use as little resources as possible so that we could use the free Google colab service. This can be achieved by using “lazy loader” loading the particular array only when accessed.
The challenge is that we have audio files with different duration some might be 3 seconds other might be 3 minutes (180 seconds). and we need our spectrograms to be for 5 seconds duration.
How can we create a lazy loader that can do all of that pre-processing and shuffling etc.
bellow is display of the first file/recording in our dataset. we can see it is around 31 seconds. We want to split this recording to 5 seconds parts that would overlap with 2.5 seconds and then to get the spectrograms of those parts.
here are the first 2 parts generated by the splitting:
we can confirm the overlapping by noticing the part in the orange rectangle
I will create a class that will open the files and create the spectrograms. Before getting to that step we have to make sure we have a file/data frame that consists of the labels of the recordings and the file paths to those recordings(as shown bellow)
steps:
1/ load the audio file
2/ split the audio file into 5 sec parts
3/ create spectrogram for each part
as we want to that for all files in our dataset in a “lazy manner” we will use the “map” and “flat_map” functions
from google.colab import drive
import pandas as pd
import tensorflow as tf
import tensorflow_io as tfio
import tensorflow_probability as tfp
from IPython.display import Audio
import matplotlib.pyplot as plt
import math
from os import path
import os
import pathlib
from tensorflow.keras import layers
from tensorflow.keras import models
import datetimetf.random.set_seed(42)
class Generate_Spec_Data:# 1/ load the audio file
def load_wav(self, filename):
'''
filename -> this is the full path to the file we want to load
We use tfio.audio.decode_vorbis as the file is with "ogg" extension.
Then we return a tensor of the same type with all dimensions
of size 1 removed by using tf.squeeze(wav, axis=-1)
'''
file_contents = tf.io.read_file(filename)
wav = tfio.audio.decode_vorbis(file_contents)
wav = tf.squeeze(wav, axis=-1)
return wav
# this will be with the map function and will provide the read
# audio file with the coresponding label
def load_wav_for_map(self, fullpath, primary_label):
wav = self.load_wav(fullpath)
return wav, primary_label
##################################################################
# 2/ split the audio file into 5 sec parts
# we have already aidentified the samling rate of the files
# to be sample_rate = 32000. More details can be found in the
# provided Github repository where much more details are available
def split_wav(self, wav, width, stride):
'''
this splits the provided audio file into parts with frame_length=width
with overlap of frame_step=stride and pad at the (meaning that we fill
the end of the last part with 0s to make it equal to the fgrame_length).
In our case we have:
frame_lenghth = seconds*sample_rate = 5*32000 = 160000 for 5 seconds parts
frame_step = seconds*sample_rate = 2.5*32000 = 80000 for overlap of 2.5 seconds
'''
return tf.signal.frame(wav, frame_length=width, frame_step=stride, pad_end=True)
def split_wav_for_flat_map(self, wav, primary_label):
'''
from 1 audio fle we generate multiple 5 seconds parts, and we want each of
them to have a corresponding label.We will also add a flag named
purpose_in_life that will latter allow us to shiffle and split all those
parts into training/validation/test sets
'''
wavs = self.split_wav(wav, width=160000, stride=80000)
labels = tf.repeat(primary_label, tf.shape(wavs)[0])
#here we assign the correct label to each new 5 sec part
probabilities = [0.7, 0.15, 0.15]
#those are the training/validation/test proportions
categorical_dist = tfp.distributions.Categorical(probs=probabilities)
purpose_in_life = categorical_dist.sample(tf.shape(wavs)[0])
return tf.data.Dataset.from_tensor_slices((wavs, labels, purpose_in_life))
##########################################################################
# 3/ create spectogram
def create_spectrogram(self, samples):
'''
we cretae a MEL spectogram for each 5 seconds part
(https://en.wikipedia.org/wiki/Mel_scale)
this is my personal choise i just think it brings beter results
'''
i = tf.abs(tf.signal.stft(samples, frame_length=512, frame_step=256))
mel_spectrogram = tfio.audio.melscale(i , rate=32000, mels=128,
fmin=0, fmax=16000)
return tfio.audio.dbscale(mel_spectrogram, top_db=50)
def create_spectrogram_for_map(self, samples, label, purpose_in_life):
'''
This functin returns a data frame with : 5 sec scpectogram; label; number that
associates this spectogram to train/validate/*test set
'''
return self.create_spectrogram(samples), label, purpose_in_life
#########################################################################
def launch(self, feature1, feature2):
'''
This takes the label and filepath for each file we have in our dataset
and execites steps 1, 2 and 3
1/ load the audio file
2/ split the audio file into 5 sec parts
3/ create spectrogram for each part
here we use the "lazy" approach again by using tf.data.Dataset.from_tensor_slices
'''
self.dataset = tf.data.Dataset.from_tensor_slices((feature1, feature2))
self.x = self.dataset.map(self.load_wav_for_map)
self.x = self.x.flat_map(self.split_wav_for_flat_map)
self.x = self.x.map(self.create_spectrogram_for_map)
return self.x
used_ds = spectrograms_dstrain_ds = used_ds.filter(lambda spectrogram, label,
purpose_in_life: purpose_in_life == 0)
#training set corresponding to 70% of the data
val_ds = used_ds.filter(lambda spectrogram, label,
purpose_in_life: purpose_in_life == 1)
#validation set corresponding to 15% of the data
test_ds = used_ds.filter(lambda spectrogram, label,
purpose_in_life: purpose_in_life == 2)
#test set corresponding to 15% of the data
######################################
remove_fold_column = lambda spectrogram, label, fold:\
(tf.expand_dims(spectrogram, axis=-1), label)
#this function takes 3 columns as input and returns only 2
# (dropping the column with "purpose_in_life")
######################################
train_ds = train_ds.map(remove_fold_column)
val_ds = val_ds.map(remove_fold_column)
test_ds = test_ds.map(remove_fold_column)
train_ds = train_ds.cache().shuffle(100000, seed=42).\
batch(64).prefetch(tf.data.AUTOTUNE)
# we need to use a big number for shuffle as the first entries correspond
# to the first label as we split them to 5 sec parts there will be a lot of
# entries with the first label in the beggining we need to have representation
# from all the labels in the batches
val_ds = val_ds.cache().batch(32).prefetch(tf.data.AUTOTUNE)
test_ds = test_ds.cache().batch(32).prefetch(tf.data.AUTOTUNE)
After we have set our data loaders it is time to build a model and train it.
I have used a very basic one with convolution to process the spectrogram images. I also implemented layer normalization.
norm_layer = tf.keras.layers.experimental.preprocessing.Normalization()
norm_layer.adapt(train_ds.map(lambda x, y: x))
######################################################model = models.Sequential([
layers.Input(shape= (624, 128,1)),
# Downsample the input.
layers.Resizing(32, 32),
# Normalize.
norm_layer,
layers.Conv2D(32, 3, activation='relu'),
layers.Conv2D(64, 3, activation='relu'),
layers.MaxPooling2D(),
layers.Dropout(0.25),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dropout(0.5),
layers.Dense(11),
])
########################################################
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'],
)
#########################################################
# set a learning rate scheduler so that we can decrese the learning rate
# as we progress through the training epochs
def lr_scheduler(epoch, lr):
if epoch < 20:
return lr
else:
return lr*0.9
#########################################################
# we would like to use tensorboard
%load_ext tensorboard
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
callbacks=[
tf.keras.callbacks.LearningRateScheduler(lr_scheduler),
tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)]
#########################################################
After we have all set up. we can start training the model.
The focus of this article is how to create the data loaders thus i skipped rather fast through the model creation.
model.fit(
train_ds,
epochs=50,
callbacks=callbacks,
validation_data = val_ds
)
you will notice that during training the total number of batches is “unknown” as it is all happening dynamically.
Yet after it is complete, all looks well 🙂
# we can now check the results in Tensorboard%tensorboard --logdir logs/fit
This approach saves us a lot of resources, this exam used 7.2 CPU RAM to preprocess and train a model. We also significantly reduced the memory used on the Hard Drive as we only keep the original audio files on the hard disk.
You can find notebook and play with the full code Here. There are also some mode details and explanations.
I want to thank “sandeepmistry” for his openly available publication that helped really a lot here. All other references can be found in my full notebook
Be the first to comment