Imagerecognition and Synthetic Data Generation using Houdini, Python and Tensorflow (Part 2) | by ironscavenger

LEGO Wheelloader built in Cinema 4D, rendered with Redshift

Overview

As a follow up to my last blog post (Link), this article will dive into more detail about synthetic data generation, featuring my latest progress on the topic.
If you’re new to the story, here’s a quick outline of the project:

The project goal is to create a system capable of processing LEGO building instructions from PDF and creating a 3D-model from it.
Image processing is done using a neural network built with Python/Tensorflow.
To train the model and making it capable of inferring 3D brick information from 2D images, synthetic data made with SideFX Houdini is used.

Key learnings

Some things I learned following my last article:

More image variance (->data augmentation) is needed. My original dataset was far too clean which led to overfitting.
Lower resolution images: Reducing image sizes from 512×512 to 128×128 significantly improved generalization of the initial model.

The model

The general model architecture will very likely be a mixed input hybrid neural network (also called HDNN) in the end.
This means the model is capable of receiving multiple different inputs at once and inferring the desired output from them.
In practice the input will be 2 images and a JSON-file.

The data

As said above, the model’s input will be 2 images and a JSON file. This will enable the model to understand the differences between two bulding steps and infer 3D information from them.

The images represent 2 consecutive building steps from the building-instructions:

Two consecutive building steps: The 1×2 brick was added on top.

The JSON file contains information from the first image, like types of bricks in the image and their position and orientation in 3D space:

{brick0:{brickID:0,
bricktype:1,
position:[0,0,0],
orientation:[0,0,0],
},
brick1:{brickID:1,
bricktype:0,
position:[2,0,1],
orientation:[0,0,0],
},
brick2:{brickID:2,
bricktype:1,
position:[1,1,-1],
orientation:[0,90,0],
}
}

Requirements and caveats

Synthetic data is a helpful way to train a model if real-life data is not available or only hard to obtain. For this project a large dataset of images from actual LEGO-instructions would be needed, also including 3D-information.
PDF-instructions are generally available on LEGO’s website, however processing them to be usable for machine learning would be a whole project on it’s own and is far too laborious.
So I’m creating synthetic data using SideFX Houdini. The data will be 3D-rendered images of LEGO-assemblies as well as JSON-files containing metadata with 3D-information.

The generated data needs to consider the following quality aspects:

shading and rendering need to mimic actual LEGO instructions as good as possible
data needs to be highly variant, meaning different brick-types, combinations, viewing angles etc.
data needs to represent many different assembly steps, e.g.:
– one or multiple bricks being added at once
– bricks being stacked on top or below the previous assembly
– bricks being stacked in between bricks of the previous assembly

As shown above, the model architecture contains pre-trained models to handle the image data. Currently I’ve not finally settled on this, but it will probably be YOLO or Mask R-CNN.
Generally, the way to detect LEGO bricks from images will be either via object detection or instance segmentation.

Object detection

Object detection means the model is able to detect the presence of individual objects in the image. The result is a 2D bounding box around the detected object. The following images shows ideal bounding boxes for the instructions seen in the previous paragraph:

To provide labeled images for this kind of algorithm, the synthetic metadata of the 3D renderings needs bounding box information for every LEGO brick present in an image.
To achieve this, the polygonal coordinates of the brick-geometry are transformed into camera-space, discarding Z-depth. Taking minimum and maximum values from the resulting 2D coordinates, a bounding box can easily be derived.

Here’s a script in Houdini VEX to retrieve the 2D bounding box from a packed primitive (= a LEGO brick) in normalized coordinates, so x and y values range from 0 to 1. Multiplying by the image resolution will yield actual pixel-coordinates.
(Use this on a prim-wrangle in the packed-prim geo-stream)

// CAUTION: Currently only works for orthogonal oriented packedprims!// Get bounding box of current packed prim
float bbox[] = primintrinsic(0,"bounds",i@primnum);
// Get camera
string cam = chs("camera");
// bbox is:
// [xmin, xmax, ymin, ymax, zmin, zmax]
vector bb_min = set(bbox[0],bbox[2],bbox[4]);
vector bb_max = set(bbox[1],bbox[3],bbox[5]);
// Infer bounding box size
vector bb_size = bb_max - bb_min;
// Create bounding box corner points and store in array
vector cp[];
append(cp, bb_min);
append(cp, bb_min + set(bb_size.x,0,0));
append(cp, bb_min + set(0,bb_size.y,0));
append(cp, bb_min + set(0,0,bb_size.z));
append(cp, bb_min + set(bb_size.x,bb_size.y,0));
append(cp, bb_min + set(bb_size.x,0,bb_size.z));
append(cp, bb_min + set(0,bb_size.y,bb_size.z));
append(cp, bb_max);
// Transform all corner points to NDC
vector cp_ndc[];
foreach(vector p;cp){
append(cp_ndc,toNDC(cam,p));
}
// Get min/max values from corner points in NDC space
vector bb_ndc_min = min(cp_ndc);
vector bb_ndc_max = max(cp_ndc);
// Store bounding box coordinates in semi MS COCO convention,
// meaning bbox-origin is on top left, but coordinates are
// normalized -> need to multiply with image res for abs. pixel values!
f@bb_x = bb_ndc_min.x;
f@bb_y = 1 - bb_ndc_max.y;
f@bb_width = bb_ndc_max.x - bb_ndc_min.x;
f@bb_heigth = bb_ndc_max.y - bb_ndc_min.y;

Should an object detection algorithm be used in the end, it will be pre-trained on the above mentioned imagery and data, before being integrated to the multi-input model.
There, the bounding-box and object information for both building-steps will be processed by a multilayer-perceptron before being concatenated with the 3D-data.

Instance segmentation

Like object detection, instance segmentation also detects the presence of objects in an image and returns bounding boxes. On top of that, instance segmentation also yields a pixel-accurate mask for each object.

As it is better suited for occluded objects (which the bricks in a LEGO assembly certainly are) it might be a more promising, although more elaborate, algorithm for this project.

Synthetic data for Instance Segmentation: the flat color shapes are image masks of the individual bricks

For model training, the synthetic data also needs to provide labelled image masks. Since image-masks are typical work-items in 3D-animation, Houdini’s Mantra render engine provides the means to generate them automatically. For this I’m exporting an extra image plane with a custom brickID float-variable that’s exported from the LEGO brick’s shader.

Although one might intuitively choose an integer-value, brickID is a float because the image is exported as a 32-Bit EXR-file. Despite the high bit-depth, some noise due to compression and other possible reason is still prevalent in the numbers.
Hence brickID is multiplied by 10 before exporting. So ID-values like 1, 2, 3 become 10, 20, 30. This way noise has a lesser effect on the mask-values. After importing the mask in Python the scale is reversed and a final rounding yields integer values again (lowest ID-value being 1, since the background is 0).

The image mask allocates a brick ID to each pixel (the mask is downscaled in this image to make the values readable)

The resulting mask can be inspected with OpenCV and Seaborn using a simple Python code:

import cv2
import matplotlib.pyplot as plt
import seaborn as sns# Path to image-mask
image_path = 'masktest_brickID.exr'
image_mask = cv2.imread(image_path, cv2.IMREAD_UNCHANGED)
# Print largest ID-value for verification
print((imageimage_mask/10).round().max())
# Flatten image to single-channel/grayscale
mask_prev = cv2.cvtColor(image_mask, cv2.COLOR_BGR2GRAY)
# Reverse mask-scaling from Houdini and resize image for preview
mask_prev = cv2.resize((mask_prev/10).round(),
(32,32),
interpolation=cv2.INTER_NEAREST,
)
# Plot heatmap with ID-annotations
plt.figure(figsize=(8,8))
ax = sns.heatmap(mask_prev,
annot=True,
cmap='tab20',
annot_kws={'size': 7},
)
# Ensure square aspect-ratio and hide axis-ticks/markings
ax.set_aspect('equal')
plt.axis('off')
plt.show()

Should instance segmentation be the algorithm of choice, it will be pre-trained on the before mentioned data.
The resulting bounding boxes, image masks and object information for both building steps will then feed into a multilayer-perceptron before being concatenated with the 3D-data.

Now that general data-formats are generally defined, various types of assemblies for training can be generated.
For this I’ve created a procedural assembly generator in Houdini. Using arbitrary base-shapes it generates a LEGO brick-assembly that is similar in shape and volume.

Procedural assembly generator in SideFX Houdini

With the help of the assembly generator and the above mentioned methods for bounding box and image mask generation it’ll be possible to create an extensive and feature-rich dataset for model training and validation.

I’ll explain more about the details of procedural assembly generation and further project-progress in my next article.