Last time, I downloaded a list of every baseball player who has ever played in the major leagues, split the names into separate lists of first and last names, and created a “vocabulary” of each character that appears in the names on the list, using one-hot encoding.

Now I was ready to create a neural network that I could train on my list and then generate baseball player names of its own.

Much of this code comes straight from Joel Grus’ fabulous book Data Science from Scratch. I updated his code with numpy arrays and refactored a lot of stuff for my specific purpose. I highly recommend his book for anyone wanting to learn data science. For those without quite so much time or inclination, here is a very thorough but comprehensible primer on exactly how neural networks work.

At the core of my network is the “Layer” class:

`##### Layers #####`

class Layer:

def forward(self, inputs):

raise NotImplementedError

def backward(self, grad):

raise NotImplementedError

def params(self) -> List[np.array]:

return ()

def grads(self) -> List[np.array]:

return ()

This is the parent class for all of my layers. Each layer type then defines its own class based on this class and inherits these four functions. The forward() function uses the params() to traverse forward through the layer, taking the inputs to this layer and calculating the outputs (which then become the inputs to the subsequent layer). The backward() function uses the grads() to propagate backwards through the layers to adjust the model’s parameters for the next pass through the training set.

In principle, these functions and their associated parameters could be almost anything. My simplest type of layer is the Linear layer:

`class Linear(Layer):`

def __init__(self, input_dim: int, output_dim: int, init: str = 'xavier') -> None:

self.input_dim = input_dim

self.output_dim = output_dim

self.w = np.array(random_tensor(output_dim,input_dim,init=init))

self.b = np.array(random_tensor(output_dim,init=init))def forward(self, inputs:np.array) -> np.array:

self.inputs = np.array(inputs)

return np.dot(self.inputs,self.w.transpose()) + self.b

def backward(self, grad: np.array) -> np.array:

self.b_grad = np.array(grad)

self.w_grad = np.outer(np.array(grad),self.inputs)

return np.array([np.sum(np.dot(self.w.transpose()[i],self.b_grad)) for i in range(self.input_dim)])

def params(self) -> List[np.array]:

return [self.w, self.b]

def grads(self) -> List[np.array]:

return [self.w_grad, self.b_grad]

The linear layer contains two parameters for each neuron, a weight (w) and a bias (b). The output of the layer calculated by the forward() method is simple:

`Oᵢ = Wᵢ * Iᵢ + Bᵢ`

In other words, the output for a given neuron is equal to the input times the weight plus the bias. Linear layers always have the same number of inputs and outputs.

Back propagation is a bit more complicated and involves some calculus. The goal is to update the weights and biases in the direction in which the loss function is most rapidly decreasing. It is left as an exercise to the reader.

The linear layer is initialized with random weights and biases. There are several ways to do this, but the one that I ended up using is Xavier initialization, which chooses random normal values centered around 0 whose scale is based on the number of input and output units of the layer.

`def random_tensor(*dims: int, init:str = 'normal', value:float = 0):`

if init == 'normal':

return np.random.normal(size=dims)

elif init == 'uniform':

return np.random.uniform(size=dims)

elif init == 'xavier':

variance = len(dims)/sum(dims)

return np.random.normal(scale=variance,size=dims)

else:

raise ValueError(f"unknown init: {init}")

The other type of layer that I use is a very simple Recurrent Neural Network (RNN) layer.

`class SimpleRnn(Layer):`

def __init__(self, input_dim: int, hidden_dim: int) -> None:

self.input_dim = input_dim

self.hidden_dim = hidden_dim

self.w = np.array(random_tensor(hidden_dim,input_dim, init='xavier'))

self.u = np.array(random_tensor(hidden_dim,hidden_dim, init='xavier'))

self.b = np.array(random_tensor(hidden_dim))self.reset_hidden_state()

def reset_hidden_state(self) -> None:

self.hidden = np.zeros(self.hidden_dim)

def forward(self, inputs:np.array) -> np.array:

self.inputs = inputs

self.prev_hidden = self.hidden

self.hidden = np.tanh(np.dot(self.w,self.inputs) + np.dot(self.u,self.hidden) + self.b)

return self.hidden

def backward(self, grad: np.array) -> np.array:

self.b_grad = grad * (1 - self.hidden ** 2)

self.w_grad = np.outer(self.b_grad,self.inputs)

self.u_grad = np.outer(self.b_grad,self.prev_hidden)

return np.array([np.sum(np.dot(self.w.transpose()[i],self.b_grad)) for i in range(self.input_dim)])

def params(self) -> List[np.array]:

return [self.w, self.u, self.b]

def grads(self) -> List[np.array]:

return [self.w_grad, self.u_grad,self.b_grad]

The goal of using an RNN is for the network to develop a sort of “memory” of what it has seen in previous iterations through the network. Without some sort of RNN, the network would choose each letter solely based on the letter immediately preceding it. It might decide that “e” is followed by “s” in most cases. But imagine we had as our input the string “Frankenste — ”. Our training data probably contains a lot of last names that end in “stein”. So we would want the network to recognize that in this context the most likely next letter would be “i”, followed by “n”, for “Frankenstein”. (It might secondarily choose “rn” if there are a lot of “stern” names in the training set).

My RNN, by keeping a running “memory” of the letters it has seen, would be able to figure this out. Last names ending in “so — ” would get an “n”. Names ending in “ma — ” would probably get an “n” or “nn”. Names ending in “sk — ” would likely get an “i”. Etc.

The details of how the network accomplishes this are not terribly complicated but are beyond the scope of this post. I highly recommend Grus’s book for a more in-depth analysis. The basic premise is that the RNN contains a “hidden” layer, and remembers the state of that layer from one letter to the next.

The next thing I needed to decide was what loss function to optimize. The final output of my layers is a vector that is the length of my vocabulary (which contains 65 characters). For a typical step forward through the layer, it might look like:

`[-4.71046723 4.77937911 3.6164262 5.6784715 5.27758362 7.14386971`

7.10323639 3.71422599 7.06697032 7.13644326 6.49689206 4.90589255

3.40343896 3.9638191 5.02996235 9.7853716 3.76766735 3.04525307

4.31365311 5.04865564 3.54316214 6.38819682 2.22243693 3.1726885

0.57863287 -0.5414254 1.96992059 1.48973753 -3.93000269 0.77282324

-3.16443625 -0.095528 -2.27799066 -0.25839462 -2.43346072 -4.25403946

-5.85953502 -2.62376915 -3.48366961 0.24044162 -3.53302845 -3.47371261

-4.10215468 -3.34675151 -2.4199506 -3.36795306 -1.91529773 -1.93689667

-4.19529354 -4.01643504 -2.88543491 -5.48029582 -4.84205942 -3.28526568

-5.29518011 -5.95704384 -5.86119124 -4.37885162 -3.7478578 -4.0281952

-3.98126127 -3.96792545 -4.08770416 -3.82748496 5.9761519 ]

What I ultimately want, however, is a vector of *probabilities* of how likely each particular letter is. Enter the “softmax” function:

Which I implement as:

`def softmax(y: np.array) -> np.array:`

e_y = np.exp(y - np.max(y))

return e_y / e_y.sum(axis=0)

This gives me a normalized vector predicting the probability of the next letter:

`[3.49203570e-07 4.61813449e-03 1.44345266e-03 1.13484733e-02`

7.60035793e-03 4.91305691e-02 4.71742462e-02 1.59175584e-03

4.54940727e-02 4.87670548e-02 2.57260420e-02 5.24095782e-03

1.16655001e-03 2.04302350e-03 5.93326155e-03 6.89511466e-01

1.67913550e-03 8.15352126e-04 2.89870716e-03 6.04521691e-03

1.34148050e-03 2.30763560e-02 3.58097012e-04 9.26167872e-04

6.91999315e-05 2.25772238e-05 2.78185343e-04 1.72104961e-04

7.62131863e-07 8.40313752e-05 1.63874508e-06 3.52631673e-05

3.97639784e-06 2.99632725e-05 3.40384706e-06 5.51191788e-07

1.10673814e-07 2.81397437e-06 1.19088579e-06 4.93436274e-05

1.13353214e-06 1.20280267e-06 6.41601633e-07 1.36562950e-06

3.45014550e-06 1.33698081e-06 5.71485697e-06 5.59274561e-06

5.84542063e-07 6.99025514e-07 2.16610818e-06 1.61713535e-07

3.06146228e-07 1.45223147e-06 1.94599223e-07 1.00391590e-07

1.10490666e-07 4.86516424e-07 9.14396676e-07 6.90853013e-07

7.24050403e-07 7.33770885e-07 6.50940421e-07 8.44409291e-07

1.52833443e-02]

In this format, it’s trivial to understand which letter the network thinks should be next:

From this vector, I can determine the “cross entropy” loss function:

`class SoftMaxCrossEntropy(Loss):`

def loss(self,predicted:np.array,actual:np.array) -> float:

return -np.sum(np.log(softmax(predicted) + 1e-30) * actual)def gradient(self,predicted:np.array,actual:np.array) -> np.array:

return softmax(predicted) - actual

This function allows the network to determine in which direction (in 65-dimensional vector space) it should step in order to get closer to the correct answer for a given input and target. Instead of a simple gradient descent, I used an optimizer with momentum, which keeps a running average of the previous gradients so that it doesn’t overreact, especially at the start of the training, where it can swing wildly.

`class Momentum(Optimizer):`

def __init__(self, learning_rate: float, momentum: float = 0.9) -> None:

self.lr = learning_rate

self.mo = momentum

self.updates = []def step(self,layer: Layer) -> None:

if not self.updates:

self.updates = [np.zeros_like(grad) for grad in layer.grads()]

for update, param, grad in zip(self.updates,layer.params(),layer.grads()):

update[:] = self.mo * update + (1 - self.mo) * grad

param[:] = param - update * self.lr

With all of this in place, I could now create my network. First, I created a Model class, which is a kind of super-Layer which contains a list of the layers in my particular network, as well as the loss function, the optimizer, the weights and gradients for each layer, and the instructions for stepping forwards and backwards through the entire model.

`class Model(Layer):`

def __init__(self,

layers:List[Layer],

loss:Loss,

optimizer: Optimizer,

) -> None:

self.layers = layers

self.loss = loss

self.optimizer = optimizerdef forward(self, inputs):

for layer in self.layers:

inputs = layer.forward(inputs)

return inputs

def backward(self, grad):

for layer in reversed(self.layers):

grad = layer.backward(grad)

return grad

def params(self) -> List[np.array]:

return (param for layer in self.layers for param in layer.params())

def grads(self) -> List[np.array]:

return (grad for layer in self.layers for grad in layer.grads())

My model contains three layers: two RNN layers followed by a Linear layer:

`def create_model(vocab, HIDDEN_DIM = 32):`

# Set up neural network

HIDDEN_DIM = 32

rnn1 = SimpleRnn(input_dim=vocab.size,hidden_dim=HIDDEN_DIM)

rnn2 = SimpleRnn(input_dim=HIDDEN_DIM,hidden_dim=HIDDEN_DIM)

linear = Linear(input_dim=HIDDEN_DIM,output_dim=vocab.size)

loss = SoftMaxCrossEntropy()

optimizer = Momentum(learning_rate = 0.01,momentum=.9)

model = Model([rnn1,rnn2,linear],loss,optimizer)

return model

I was now ready to train the model!

`def train(model: Model, `

names: List,

batchsize: int,

n_epochs: int,

weightfile,

vocab: Vocabulary):

for epoch in range(n_epochs):

random.shuffle(names)

batch = names[:batchsize]

epoch_loss = 0

for name in tqdm.tqdm(batch):

model.layers[0].reset_hidden_state()

model.layers[1].reset_hidden_state()

name = START + name + STOP

for prev,nexts in zip(name,name[1:]):

inputs = vocab.one_hot_encode(prev)

targets = vocab.one_hot_encode(nexts)

predicted = model.forward(inputs)

epoch_loss += model.loss.loss(predicted,targets)

gradient = model.loss.gradient(predicted,targets)

model.backward(gradient)

model.optimizer.step(model)

print(epoch,epoch_loss,generate(model, vocab))

save_weights(model,weightfile)

I make use of the tqdm library, which allows me to keep track of the network’s progress via a progress bar. For each epoch, I shuffle the list of names and choose a subset to train the network on. In theory, this allows the network to train quicker, since it does not need to train on every single name in every iteration. For the results in practice, see part 4 of this series.

For each name, the network resets the hidden states of the RNN layers and adds the START and STOP characters (see last post for explanation). For each letter in the name, the input is that letter and the target is the following letter. (Actually, since the hidden state is not reset *between* each step through the network, the actual input is every letter up to and including the current letter, all of which information is necessary to predict the next letter. That’s what makes it “recurrent”).

The model steps forward, starting with the input letter. It sees how close the output of the network is to the target letter, then steps backward, calculating the gradients, which it uses to update the weights to get a bit closer to the target next time. I output the total loss for each epoch so I can see if it is going up or (ideally) down and how fast it is changing. I also output a sample name, so I can get a subjective flavor of how well the network is doing.

At first, these names were garbage such as “petolbimtictBeo” and “cM”. After a few epochs, they got a bit more coherent, if not much more real sounding: “Mebmtetn”, “Hoehos”. Eventually, they started to resemble real last names: “Wason”, “Maicher”, “Tealman”, “Da Lass”, “Fiszagson”.

The names are generated by a forward pass through the network:

`def generate(model: Model, `

vocab: Vocabulary,

seed_char: str = START,

max_len: int = 160) -> str:

model.layers[0].reset_hidden_state()

model.layers[1].reset_hidden_state()

output = [seed_char]while output[-1] != STOP and len(output) < max_len:

this_input = vocab.one_hot_encode(output[-1])

predicted = model.forward(this_input)

probabilities = softmax(predicted)

next_char_id = sample_from(probabilities)

output.append(vocab.get_word(next_char_id))

return ''.join(output[1:-1])

This function resets the hidden states, starts with my START character, and predicts the next character. Then, *without* resetting the hidden state, we predict the following character, and the one following that, etc. It keeps going until it either predicts the STOP character or reaches a maximum length. In practice, the latter never happens once the network is trained.

I repeated this entire process with first names, training a new network from my list of first names and then generating names based on those weights. Finally, I picked a suffix at random from my list of suffixes:

`def random_suffix() -> str:`

suffix = random.choice(suffixes)

return suffix if suffix is not None else ""

I then combined it all into one big list:

`for i in range(100):`

print(generated_first_names[i],generated_last_names[i],random_suffix())

And so I have a baseball roster for a non-existent team!

`Kiny Pest `

Edwin Marke

Bob Crack

Mickes Katt

Brord Heckeis

Ad Crorthwer

Wilzan Chantir

Man Wueno

Conn Cillrer

Sonny Couezt

Chris Carron

Bron Wassard

Tordects Purro

Donny Trawstos

Wank Jaresel

Donus Kur

Fred Griffe

Ken Frown

Auman Mittk

Willes Mires

Juas Gownman

Roy Atte

Felbon O'Maa

Frorie Brustuud

Peter Zolsa

Henn Lavartield

James Vann

Reggan Enton

Fory Schtirz

Mike Zarthishell

Don Ryannimon

Millie Ladell

Denn Bary

Gene Kay Yessatchy

Oker Cwith

Ralph Neoncher

Doug Fidson

Reg Brykeran

Jack Allen

Larry Diller

Pete Walkitter

Jim Kerra

Chris Wiggkowieldo

Ryan Mackerring

Stan Sh

Rou Tretus

Jeff Holt

Tom Alberezers

Fred Renzorech

Endiel Ottle

Toss Hill

Ramón Haly

Roy Haden

Jim McGadrar

Howard Willer

Floy Dujlin

Nebbie Santers

Charlie Jamer

Doug Kelhantz

Nic Willing

Hect Cocker

John Ruerickney

Hen Ry

Eusas Tatuin

Rert Balleallolling

Isan Bradnien

Rick Thombau

Tompend Ruletlin

Shaur Cosper

Tom Brither

Mickett Colmandez

Bill Distid

Kindy Mincan

Carl Jall

George McCarriffid

Ray McGourkettree

Mike Dianton

Walt Jammer

Billy Moranay

Dave Fiten

Rube Froqfieldeuez

Luis Warrin

Juniel Hankre

Vince Yohsty

Felix Renson

Ster Morristjiney

Millie Edwut

Larry Pantincyrond

Mike Jarth

Jyan Qoud

Tunmel Fribitistio

Craig Fitz

Anth Kíener

John Nahrin

Mike Maravantz

Meas Piesender

Will Mendnenst

Ernie Cirffardart

Bob Fardon

Robin Krich

In the next installment in this series, I redo all of this the “proper” way, using the TensorFlow package Keras. And then in a pair of wrap-ups, I compare the two methods on various measures of speed andaccuracy.

Full code available at: https://github.com/stevendegennaro/datasciencefilmmaker/tree/main/character_level_rnn

## Be the first to comment