A very gentle introduction to real-life AI: Part I — Simple, custom dataset project | by Slawomir Telega

If one considers working in the field of Aritficial Intelligence, Keras library is among the most frequent choices. It’s not a suprise, as the learning curve for the above mentioned is not very steep — especially using the Sequential model. It’s like laying the blocks using a very starightforward manual — one stacks layer on the top of another. Simple as that. Working with datasets is also rather not complicated at that stage — one can use built-in functions to generate datasets from user supplied data. Summarizing, it’s really simple to get a ‘dog vs. cat’ classifier up and running. And it’s great — most of the people will stop here and be content.

But, once one passes the stage of learning from tutorials and tries to work on own projects, it turnes out that the world out there is more complicated and requires much more if you intend to do something not out of the box. The sequential model is not an option in case of multiple inputs and outputs, custom data requires subclassing of Keras classes, fancy architectures require writting own training loops, etc. I find this stage crucial, as many wannabe AI developers will resign at this point — it’s not very effective or rewarding and is more of wrangling with data and OOP than doing cool neural networks. Therefore, I’ve decided to put together a series of tutorials. By no means what I show is the only, or the best ways to do it — it is merely the way I do it. That said, let’s get our hands dirty.

Though real life data usually exceeds tens of thousands rows, for the sake of illustration I’ll use an extremally small, yet well known dataset, namely the famous Iris dataset, consisting of 150 rows of data — 4 numerical input features and one categorical output (3 classes). The size of the set practically makes it sensless to use neural networks, but allows one to easily follow and check all the calculations.

First, one has to read the csv file and transform it:

# All the needed imports
import tensorflow as tf
from tensorflow import keras
from keras import layers
import numpy as np
import pandas as pd# Open csv file
train_data = pd.read_csv("Iris.csv")
df = train_data.copy()

As it may be seen, Species column contains non-numerical data, which should be converted before proceeding further. The easiest way is to convert into integer values, as shown below:

def convert_df(df, col):
# get all the unique column values
u_vals = df[col].unique() # create a dictionary
D= dict((u_val, i) for i, u_val in enumerate(u_vals))
# and map spieces' names into integers
df[col] = df[col].map(D)
return df

Now one has everything needed to finalize the input data elaboration:

# coding 'Species' as integers
df = convert_df(df, 'Species')# dropping 'Id' column (useless)
df.drop('Id', axis='columns', inplace=True)
# converting to numpy array 
xy = np.array(df)
# and finally shuffling and dividing into training and evaluation part
np.random.shuffle(xy)
xy_train = xy[:125,]
xy_val = xy[125:,]

The thing that impressed me the most in neural networks learning curve is, that no matter what environment one settles for, there is a steep fragment moving from tutorials to real life scenarios. It is my impression, that no matter how many of those you work through, you won’t be ready for whats to come. As it is often stressed, Deep Learning is 90% data wrangling and 10% cool stuff (at most :P). And usually, automated solutions work great in tutorials, but once you start working on your own projects they just don’t do what you’d like them to do. The solution is to leave to safe elegance of built-in functions and dive into the murky world of subclassing. You loose the simplicity, but gain almost unlimited control of the code.

In Keras, the simplest way to implement such a solution is subclassing a Sequence class (keras.utils.Sequence). Complicated as it sounds, turns out to be quite starightforward (at least at the basic level). One just has to change some of the class’ methods to his liking. So, long story short, here it goes:

class IrisDataset(tf.keras.utils.Sequence):# init all the inputs and dims
def __init__(self, data, 
batch_size=5):
self.data = data
self.batch_size = batch_size 
self.n_rows = data.shape[0] # gets numer of rows in the input data 
# method gets data from input array, transforms and spits out 
# batches of data
def __getitem__(self, ind)       
# get rows from the input data:
#     ind - numer of the batch (of size = batch_size)
pos = ind*self.batch_size
dummy = self.data[pos:pos+self.batch_size,:] 
# separate the columns into X, 
#   and y (features and labels respectively)
X = dummy[:, :4]
y = dummy[:, 4]
# one hot encode the Species column 
Y = self.__convert_to_ohe__(y)
return X,Y
def __len__(self):
# get numer of possible batches 
return math.ceil(self.n_rows / self.batch_size) 
# fires on every epoch end
def on_epoch_end(self):
# shuffling input data rows every epoch
np.random.shuffle(self.data)
# method coverts integers to 
# one hot encoded vectors of lenght 3 (number of classes)
def __convert_to_ohe__(self, Y):
# create array of zeros
output = np.zeros((Y.shape[0], 3))
# loop over data rows
for i, val in enumerate(Y):
# for a given row, class value tells which vector cooridinate
# should be '1'
output[i, int(val)] = 1
return output

There, wasn’t that hard. Dataset creation is now pretty trivial — one just initializes IrisDataset with the apropriate input array:

ds_train = IrisDataset(xy_train)
ds_val = IrisDataset(xy_val)

The rest is straightforward, and probably doesn’t have to be explained:

# function creates fcnn model
def get_model():inputs = keras.Input(shape=(4,))
x = layers.Dense(16, activation='relu')(inputs)
outputs = layers.Dense(3, activation='softmax')(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
return model
# get the model
model = get_model()
# and fit it
history = model.fit(ds_train,validation_data=ds_val, epochs=100)

Summing up, the problem itself should be solved using completely different methods (classical ML, eg. SVM or K-means), as due to small datasize our model overfits almost from the start. But the purpose of the above tutorial was to show and explain how to create a simple, custom dataset. Hope it will be helpfull. If you feel that you need more, there is an excellent article by Arjun Muraleedharan, which I wish I had dicovered before learning the hard way by myself. It’s really worth reading.

I’d love to hear from you, if you have any comments or suggestions. Please do not hesitate to write.

Cheers, S.

Below I attach the code in one piece:

###
# IMPORTS
###import tensorflow as tf
from tensorflow import keras
from keras import layers
import numpy as np
import pandas as pd
###
# SUBCLASSING
###
class IrisDataset(tf.keras.utils.Sequence):
# init all the inputs and dims
def __init__(self, data, 
batch_size=5):
self.data = data
self.batch_size = batch_size 
self.n_rows = data.shape[0] # gets numer of rows in the input data
# method gets data from input array, transforms and spits out 
# batches of data
def __getitem__(self, ind)       
# get rows from the input data:
#     ind - numer of the batch (of size = batch_size)
pos = ind*self.batch_size
dummy = self.data[pos:pos+self.batch_size,:]
# separate the columns into X, 
#   and y (features and labels respectively)
X = dummy[:, :4]
y = dummy[:, 4]
# one hot encode the Species column 
Y = self.__convert_to_ohe__(y)
return X,Y
def __len__(self):
# get numer of possible batches
return math.ceil(self.n_rows / self.batch_size)
# fires on every epoch end
def on_epoch_end(self):
# shuffling input data rows every epoch:
np.random.shuffle(self.data)
# method coverts integers to 
# one hot encoded vectors of lenght 3 (number of classes)
def __convert_to_ohe__(self, Y):
# create array of zeros
output = np.zeros((Y.shape[0], 3))
# loop over data rows
for i, val in enumerate(Y):
# for a given row, class value tells which vector cooridinate
# should be '1'
output[i, int(val)] = 1
return output
###
# FUNCTIONS
###
def convert_df(df, col):
# get all the unique column values
u_vals = df[col].unique() 
# create a dictionary
D= dict((u_val, i) for i, u_val in enumerate(u_vals))
# and map spieces' names into integers
df[col] = df[col].map(D)
return df
# function creates fcnn model
def get_model():
inputs = keras.Input(shape=(4,))
x = layers.Dense(16, activation='relu')(inputs)
outputs = layers.Dense(3, activation='softmax')(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
return model
###
# CODE 
###
# Open csv file
train_data = pd.read_csv("Iris.csv")
df = train_data.copy()
df.head()
# coding 'Species' as integers
df = convert_df(df, 'Species')
# dropping 'Id' column (useless)
df.drop('Id', axis='columns', inplace=True)
# converting to numpy array 
xy = np.array(df)
# shuffling and dividing into training and evaluation part
np.random.shuffle(xy)
xy_train = xy[:125,]
xy_val = xy[125:,]
# creating datasets
ds_train = IrisDataset(xy_train)
ds_val = IrisDataset(xy_val)
# get the model
model = get_model()
# and fit it
history = model.fit(ds_train,validation_data=ds_val, epochs=100)