Training a Recurrent Neural Network to recognise sketches in a real-time game of Pictionary | by William Seymour

Google has made 50 million drawings made in the Quick, Draw! game available to download. Of several file formats, I chose the custom .bin due to its smaller file size and efficient processing, though this necessitates a bespoke file parser. You can find all of Google’s documentation here.

Each drawing contains an array of strokes, which can be of varying length. Each stroke is also of varying length, and contains two tuples of x and y point coordinates, in the form:

((x1, x2, x3, …), (y1, y2, y3, …))

…such that the stroke can be plotted with (x1, y1), (x2, y2), (x3, y3), …

The raw data looks like this:

((157, 117, 51, 25, 17, 22, 27, 37, 40, 85, 148), (248, 250, 239, 226, 205, 184, 175, 169, 179, 194, 196))
((148, 148), (196, 196))
((114, 206, 215, 246, 255, 243, 214, 205, 195, 190, 104), (213, 183, 155, 127, 108, 93, 94, 99, 123, 126, 154))
((205, 213, 215), (98, 78, 63))
((121, 161, 174, 181, 179, 157, 132, 73, 38, 20, 8, 0, 0, 6, 19, 29, 35), (142, 106, 87, 65, 35, 12, 4, 0, 7, 16, 36, 75, 117, 137, 155, 165, 166))
((87, 88, 127, 128, 120, 101, 81, 63, 44, 36, 35, 44, 57, 76, 92, 103, 107, 104, 90, 85, 83, 85, 90, 101), (133, 124, 78, 56, 39, 29, 30, 35, 44, 72, 91, 107, 114, 115, 109, 99, 91, 65, 63, 66, 73, 78, 77, 66))p

And when plotted, it looks like this:

A snail, with each stroke coloured separately

A few points about these points:

The y-axis is inverted, with (0, 0) being the top left corner. This is the convention in computer graphics, so we’ll have to make sure we observe this consistently.
Note that the drawing fills the canvas in the x-axis, and there is no padding on three sides. This is because the drawing has been rescaled to fill the frame, while maintaining aspect ratio.
The points are sparser than you would expect from a human hand, resulting in long straight lines between them. This is because the drawings have been simplified using the Ramer-Douglas-Peucker algorithm.

After training the model, we will have to recreate these steps for new drawings in order to make accurate predictions.

Classifying incomplete drawings

A design goal of this project is to have the AI guess in real-time. This means that we want to predict from the model after every stroke. This presents a couple of problems:

If the model has only ever seen finished drawings, it likely won’t predict well from partial drawings.
Incomplete drawings might be of very different scales, because the final drawings have been rescaled according to their final dimensions.

The solution to this is to include in the training set drawings at every stage of completion. If we treat each drawing-stage as a separate drawing, and rescale it accordingly, then we have a consistent mechanism for handling incomplete sketches. This should also help with overfitting, since:

Despite showing the model the same strokes multiple times, they may appear in different places and at different scales.
These incomplete drawings have high uncertainty associated with them, so also act like a form of noise.

However, a more complete solution would involve adding noise (in the form of random positional adjustments, translations, rotations, aspect ratio changes, etc) to incomplete drawings to more fully address the potential for overfitting. Alternatively, we could include each drawing only once, picking a stage from it at random.

For the snail above, the first strokes are unlikely to prove instructive. For the Eiffel Tower below, though, stroke 0 should be enough to make a good prediction.

Input data

One final layer of processing is required before the model can use this data.

The strokes are scaled to a 0–1 range. This helps training but also provides a consistent base for building applications.
The strokes are turned from points into deltas — or the change in x, y from the previous point.
For each point, a binary flag indicates whether the point is the start of a new stroke.
The data is padded with zeroes to a predefined length (in our case, there is a max of 200 points). This is because the neural network will expect inputs of consistent shape.

All of this means that the final form for each drawing is an array of shape (200, 3), where each row comprises [x, y, z], where -1 ≤ x ≤ 1, -1 ≤ y < 1, and z = 0 | 1.

array([[ 0.18503937,  0.06692913,  1.        ],
[ 0.        ,  0.31496063,  0.        ],
[-0.04330709,  0.27165354,  0.        ],
[-0.03937008,  0.13385827,  0.        ],
[-0.1023622 ,  0.21259843,  0.        ],
[ 0.03937008, -0.02755906,  0.        ],
[ 0.0511811 , -0.01181102,  0.        ],
[ 0.09055118,  0.01181102,  0.        ],
[ 0.        , -0.0511811 ,  0.        ],
[ 0.06692913, -0.20472441,  0.        ],
[ 0.01968504, -0.00787402,  0.        ],
[ 0.14566929,  0.02362205,  0.        ],
...

Source link