Why TensorFlow metrics should not be trusted | by Vuk Ignjatovic | Jan, 2024

Today I have finally fixed my bug which has haunted me for the last two months and made my model complete. After countless hours of debugging with skit-learn and using methods like balancing weights, I have concluded that TensorFlow’s evaluate() function in v 2.12 is not working on Ubuntu systems.

When I asked TensorFlow to evaluate my model, he gave me the following metrics:

loss: 0.3311 - 
binary_accuracy: 0.8877 -
precision: 0.0000e+00 -
recall: 0.0000e+00 -
auc: 0.7139 -
val_loss: 0.3415 -
val_binary_accuracy: 0.8979 -
val_precision: 0.0000e+00 -
val_recall: 0.0000e+00 -
val_auc: 0.7410 - 445s/epoch - 2s/step

But when I asked Skit Learn to give me my confusion matrix he gave me the image above, explaining why my model cannot predict correctly.

For some context I am building a skin cancer detection app and used the notorious HAM10000 dataset from Harvard University. The dataset is extremely imbalanced regarding the proportion of malignant vs benign diseases labeled.

This means that my model’s precision and recall which are the most important features of a biotech ai model are close or equal to 0. After validating and testing the model I got the following metrics

Validation Results: [0.31469643115997314, 0.8958333134651184, 0.0, 0.0, 0.7172790765762329]
Test Results: [0.3305717408657074, 0.890625, 0.0, 0.0, 0.6953222751617432]

But then he changed his opinion after performing the following commands

evaluation_results = loaded_model.evaluate(test_dataset, steps=test_size // batch_size, verbose=0)
precision = precision_score(true_labels, predicted_labels, average='weighted')
recall = recall_score(true_labels, predicted_labels, average='weighted')
accuracy = evaluation_results[1]
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
Accuracy: 0.8896484375
Precision: 0.7933743999940139
Recall: 0.8907156673114119

When I tested the model with some samples manually I got fine results, I just needed to re-adjust the threshold. I thought that because I only did 10 epochs I could increase the precision and recall. After doing more epochs than 8 the model decided to stop me because of the EarlyStopping function I used to prevent overfitting.

After stopping my model had higher performance metrics by the evaluate function but If you tested it manually you would see that it was a mess. Skit learn also gave him the following values

Accuracy: 0.693414
Precision: 0.5
Recall: 0.0430215166739

Now why this could be the result of a rushed update from TF’s team to get the v12.2 before the holidays? It could also be that validating an imbalanced dataset causes some changes in the performance.

I compared multiple times Skit Learn’s parameters to TF params and noticed that SL was always closer to what I experienced through manual testing. The moral of the story is don’t trust libraries to do everything for you and aim to be flexible and use other libraries if needed.

Source link

Be the first to comment

Leave a Reply

Your email address will not be published.