3 Practical Tips to Combat Data Scarcity in Music AI | by Max Hilsdorf | May, 2024

Banner generated with DALL-E 2 by the author.

Bigger = Better?

In AI, bigger is often better — if there is enough data to feed these large models. However, with limited data, bigger models are more prone to overfitting. Overfitting occurs when the model memorizes patterns from the training data that do not generalize well to real-world data examples. But there is another way to approach this that I find even more compelling in this context.

Suppose you have a small dataset of spectrograms and are deciding between a small CNN model (100k parameters) or a large CNN (10 million parameters). Remember that every model parameter is effectively a best-guess number derived from the training dataset. If we think of it this way, it is obvious that it is easier for a model to get 100k parameters right than it is to nail 10 million.

In the end, both arguments lead to the same conclusion:

If data is scarce, consider building smaller models that focus only on the essential patterns.

But how can we achieve smaller models in practice?

Don’t Crack Walnuts with a Sledgehammer

My learning journey in Music AI has been dominated by deep learning. Up until a year ago, I had solved almost every problem using large neural networks. While this makes sense for complex tasks like music tagging or instrument recognition, not every task is that complicated.

For instance, a decent BPM estimator or key detector can be built without any machine learning by analyzing the time between onsets or by correlating chromagrams with key profiles, respectively.

Even for tasks like music tagging, it doesn’t always have to be a deep learning model. I’ve achieved good results in mood tagging through a simple K-Nearest Neighbor classifier over an embedding space (e.g. CLAP).

While most state-of-the-art methods in Music AI are based on deep learning, alternative solutions should be considered under data scarcity.

Pay Attention to the Data Input Size

More important than the choice of models is usually the choice of input data. In Music AI, we rarely use raw waveforms as input due to their data inefficiency. By transforming waveforms into (mel)spectrograms, we can decrease the input data dimensionality by a factor of 100 or more. This matters because large data inputs typically require larger and/or more complex models to process them.

To minimize the size of the model input, we can take two routes

  1. Using smaller music snippets
  2. Using more compressed/simplified music representations.

Using Smaller Music Snippets

Using smaller music snippets is especially effective if the outcome we are interested in is global, i.e. applies to every section of the song. For example, we can assume that the genre of a track remains relatively stable over the course of the track. Because of that, we can easily use 10-second snippets instead of full tracks (or the very common 30-second snippets) for a genre classification task.

This has two advantages:

  1. Shorter snippets result in fewer data points per training example, allowing you to use smaller models.
  2. By drawing three 10-second snippets instead of one 30-second snippet, we can triple the number of training observations. All in all, this means that we can build less data-hungry models and, at the same time, feed them more training examples than before.

However, there are two potential dangers here. Firstly, the snippet size must be long enough so that a classification is possible. For example, even humans struggle with genre classification when presented with 3-second snippets. We should choose the snippet size carefully and view this decision as a hyperparameter of our AI solution.

Secondly, not every musical attribute is global. For example, if a song features vocals, this doesn’t mean that there are no instrumental sections. If we cut the track into really short snippets, we would introduce many falsely-labelled examples into our training dataset.

Using More Efficient Music Representations

If you studied Music AI ten years ago (back when all of this was called “Music Information Retrieval”), you learned about chromagrams, MFCCs, and beat histograms. These handcrafted features were designed to make music data work with traditional ML approaches. With the rise of deep learning, it might seem like these features have been entirely replaced by (mel)spectrograms.

Spectrograms compress music into images without much information loss, making them ideal in combination with computer vision models. Instead of engineering custom features for different tasks, we can now use the same input data representation and model for most Music AI problems — provided you have tens of thousands of training examples to feed these models with.

When data is scarce, we want to compress the information as much as possible to make it easier for the model to extract relevant patterns from the data. Consider these four music representations below and tell me which one helps you identify the musical key the fastest.

Examples of four different representations of the same song (“Honky Tonk Woman” by Tina Turner). Although the chromagram is roughly 700k smaller than the waveform, it lets us identify the key much more effectively (C# major). Image created by the author.

While mel spectrograms can be used as an input for key detection systems (and possibly should be if you have enough data), a simple chromagram averaged along the time dimension reveals this specific information much quicker. That is why spectrograms require complex models like CNNs while a chromagram can be easily analyzed by traditional models like logistic regression or decision trees.

In summary, the established spectrogram + CNN combination remains highly effective for many problems, provided you have enough data. However, with smaller datasets, it might make sense to revisit some feature engineering techniques from MIR or develop your own task-specific representations.

Source link

Be the first to comment

Leave a Reply

Your email address will not be published.