Building a Lego Technic sorter with Real-Time Advanced Object Recognition

Aveek Goswami
Towards Data Science
9 min readNov 2, 2023

--

During my internship at Nullspace Robotics, I had the privilege of diving into a project that would enhance the company’s capabilities. We integrated object detection and machine learning image recognition to develop a machine that classifies lego technic pieces in real time.

In this blog post, I’ll guide you through the challenges encountered and how we successfully brought this project to fruition.

Amos Koh and I spent our summer ’22 teaching students programming and robotics while working on this project for Nullspace. You can find us at the links below the article.

Nullspace Robotics is Singapore’s leading provider of robotics and programming education for primary and secondary school students. A large part of their operations involve building robots with lego technic parts which are sorted into specific trays. You can imagine it’s a nightmarish task asking an 8 year old with boundless energy to help put the pieces back into the tray when all they want to do is build more things.

Nullspace tasked us with making a machine that can sort the lego technic pieces into specific categories with minimal human intervention to solve one of the key efficiency challenges when conducting a robotics lesson

Defining the Challenge

The project involved 3 main parts: real-time object and motion detection, image recognition and building the hardware of the machine. Due to the time constraints of the internship, we primarily focused on the first two items which involved the software aspects of the project.

A key challenge was to recognise moving parts and identify them within the same frame. We contemplated two approaches: integrating machine learning image recognition into the object detection camera or keeping the processes separate.

Ultimately, we decided on separating object detection and recognition. This approach involved first capturing a suitable picture after detecting the object and then running a model to clasify the image. Integrating the processes together would require running the model on practically every frame to classify every object detected. Separating them eliminated the need for the model to be in a constant processing mode, ensuring a smoother and more computationally efficient operation.

Object Detection

We used ideas from the projects cited below the article to implement our object/motion detection program and customise it to lego pieces

In our case, we used similar motion detection concepts because our machine would involve a conveyor belt system of uniform colour, so any motion detected would be due to a lego piece moving on the belt.

We applied gaussian blurring as well as other image processing techniques to all frames and compared it with previous frames. Further processing was done to isolate (draw bounding boxes around) the items causing motion as shown below:

for f in camera.capture_continuous(rawCapture, format="bgr", use_video_port=True):

frame = f.array # grab the raw NumPy array representing the image
text = "No piece" # initialize the occupied/unoccupied text

# resize the frame, convert it to grayscale, and blur it
frame = imutils.resize(frame, width=500)
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
gray = cv2.GaussianBlur(gray, (21, 21), 0)

# if the average frame is None, initialize it
if avg is None:
print("[INFO] starting background model...")
avg = gray.copy().astype("float")
rawCapture.truncate(0)
continue

# accumulate the weighted average between the current frame and
# previous frames, then compute the difference between the current
# frame and running average
cv2.accumulateWeighted(gray, avg, 0.5)
frameDelta = cv2.absdiff(gray, cv2.convertScaleAbs(avg))

# threshold the delta image, dilate the thresholded image to fill
# in holes, then find contours on thresholded image
thresh = cv2.threshold(frameDelta, conf["delta_thresh"], 255,
cv2.THRESH_BINARY)[1]
thresh = cv2.dilate(thresh, None, iterations=2)
cnts = cv2.findContours(thresh.copy(), cv2.RETR_EXTERNAL,
cv2.CHAIN_APPROX_SIMPLE)
cnts = imutils.grab_contours(cnts)

# loop over the contours

for c in cnts:
# if the contour is too small, ignore it
if cv2.contourArea(c) < conf["min_area"]:
continue

# compute the bounding box for the contour, draw it on the frame,
# and update the text
(x, y, w, h) = cv2.boundingRect(c)
cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2)
piece_image = frame[y:y+h,x:x+w]
text = "Piece found"
# cv2.imshow("Image", image)

To ensure the motion was actually caused by a lego piece, the stability of the motion detection was assessed using a motion counter, which checked that motion was detected for a certain number of frames before concluding that the motion was actually due to a lego piece and not miscellaneous noise. The final image is then saved and fed into our CNN to classify it.

if text == "Piece found":
# to save images of bounding boxes
motionCounter += 1
print("motionCounter= ", motionCounter)
print("image_number= ", image_number)

# Save image if motion is detected for 8 or more successive frames
if motionCounter >= 8:
image_number +=1
image_name = str(image_number)+"image.jpg"
cv2.imwrite(os.path.join(path, image_name), piece_image)
motionCounter = 0 #reset the motion counter

# classify the saved image with our model, see below

Creating the Model

Building the dataset

We created the dataset of images ourselves rather than using images of lego technic pieces found online because we wanted to replicate the conditions under which the model would be detecting and classifying the pieces eventually. We hence designed a simple conveyor belt system using none other than lego technic pieces themselves! We then hooked it up to an lego spike prime motor to keep the conveyor belt moving.

Designing the model architecture

To address the heart of the challenge, I adapted a machine learning model I found on Aladdinpersson’s GitHub repository. This model featured convolutional layers with a sequence from 128 to 64 to 32 to 16, an architectural choice designed to improve image recognition.

Instead of using a pre-trained model, we designed our own convolutional neural network because:

  1. We did not require particularly deep feature extraction for our images
  2. We wanted to keep the model small and reduce its complexity, at the same time reducing the computational cost of running the model. This would enable to run the CNN as a tflite model more efficiently on the Pi.

Data normalization was a crucial step to ensure consistent training accuracy, especially given the variation in the range of values captured by different images due to lighting differences.

In this model, various layers such as ReLU, dense, softmax, and flatten played pivotal roles. ReLU activation, for example, was essential for image classification as it mitigated the issue of vanishing gradients in image recognition. Dense layers, on the other hand, are standard in Tensorflow models, facilitating densely connected neural networks. Softmax activation was used to calculate probabilities for each category in our dataset.

For loss functions, we employed Keras’ Sparse Categorical Cross Entropy, a fitting choice for multi-class classification tasks. The Keras Adam optimizer, renowned for its efficiency, was used to fine-tune the model.

Training and Optimization

Epochs were carefully selected to strike a balance between training and overfitting, with a preference for a number below 200 to ensure optimal model performance. For accelerated model training, we harnessed Google Colab, which provided access to GPU resources, ensuring significantly faster training speeds compared to our own laptops.

The full model architecture is shown below:

data_augmentation = keras.Sequential([
layers.RandomFlip("horizontal",
input_shape=(img_height,
img_width,
1)),
layers.RandomRotation(0.2),
layers.RandomZoom(0.1),
])

model = keras.Sequential(
[
data_augmentation,

layers.Rescaling(1./255, input_shape = (img_height,img_width,1)), #normalize the data input

layers.Conv2D(128, 3, padding="same", activation='relu'),
layers.MaxPooling2D(pool_size=(2,2)),

layers.Conv2D(64, 3, padding="same", activation='relu'), #should this be 16 or 32 units? try with more data
layers.MaxPooling2D(pool_size=(2,2)),

layers.Conv2D(32, 3, padding="same", activation='relu'),
layers.MaxPooling2D(pool_size=(2,2)),

layers.Conv2D(16, 3, padding="same", activation='relu'),
layers.MaxPooling2D(pool_size=(2,2)),

layers.Dropout(0.1),
layers.Flatten(),
layers.Dense(10,activation = 'relu'),
layers.Dense(7,activation='softmax'), # number of output classes

]
)

model.compile(
optimizer=keras.optimizers.Adam(),
loss=[keras.losses.SparseCategoricalCrossentropy(from_logits=False),],
metrics=["accuracy"],
)

model_history = model.fit(x_train, y_train, epochs=200, verbose=2, validation_data=(x_test,y_test), batch_size=25) #i think 25/32 is the best batch size

Choosing the architecture

In most common CNN architectures, the number of filters increases to try and capture more complex features at higher layers. However, for distinguishing between our Lego pieces, there was a high degree of similarity among classes, and we need our network to look for specific features such as bends and holes. I felt that a smaller number of filters in deeper layers might help in focusing on these fewer subtle differences, rather than looking at multiple features which may not help in discriminating the pieces.

We tested both architectures out, with decreasing filters and increasing filters, and the decreasing filter model performed significantly better. It hence seems that by reducing the number of filters, we can get the network focus on what is essential, reducing noise from complex feature maps.

Of course, it depends on your use case and the distinguishing features present in your dataset. Something like facial recognition for example would need a more complex feature map, so the increasing filter approach may work better.

Model results

The model was trained with 6000 images spanning 7 categories of lego technic blocks. It achieved a final validation accuracy of 93%. Diagrams showing the progression of training as well as a confusion matrix to assess performance are shown below:

Implementing the model on the Raspberry Pi

The most efficient way to run a neural network on a pi is as a tflite (tensorflow lite) model. We saved the model locally and then loaded it onto the Pi.

from tflite_runtime.interpreter import Interpreter

# Load TFLite model and allocate tensors.
interpreter = Interpreter(model_path="lego_tflite_model/detect.tflite") # insert path to the tflite model
interpreter.allocate_tensors()

Continuing from the motion counter for loop above, the suitable images were then fed into the neural network to be classified:

 # continuing from if text == "Piece found":
# Open the image, resize it and increase its contrast
input_image = Image.open('lego-pieces/'+ image_name)
input_image = ImageOps.grayscale(input_image)
input_image = input_image.resize((128,128))
input_data = img_to_array(input_image)
input_data = increase_contrast_more(input_data)
input_data.resize(1,128,128,1)

# Pass the np.array of the image through the tflite model. This will output a probablity vector
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])

# Get the index of the highest value in the probability vector.
# This index value will correspond to the labels vector created above (i.e index value 1 will mean the object is most likely labels[1])
category_number = np.argmax(output_data[0])

# Return the classification label of the image
classification_label = labels[category_number]
print("Image Label for " + image_name + " is :", classification_label)



else:
motionCounter = 0 # reset motion counter to look for new objects

Flexibility was a key consideration. The motion counter could be adjusted for the process of either capturing images to build the dataset or set the threshold for when the image should be captured for classification, enhancing the system’s versatility.

Demonstration

The culmination of our efforts was a showcase of the system’s overall accuracy, supported by photos and videos capturing its operation. The conveyor belt setup (above) was an essential part of this demonstration:

Future Work and Areas for improvement

Software: A future extension would also be to include a quality checker model in the operation to ensure that the images used to classify the pieces are suitable.

Hardware: The model would undoubtedly benefit from a superior camera for higher quality images. Moreover, the conveyor belt system built temporarily for our testing and demonstration will need to be scaled up to accommodate more pieces. A method will also need to be devised and implemented to separate multiple lego pieces and ensure only one piece is visible in the camera’s frame at the time. There are similar projects available online which go into detail on possible methods.

Conclusion

My journey at Nullspace Robotics was my first foray into building my own neural network for practical purposes. Having designed models as part of training courses in the past, it’s a wholly different experience creating one intended for actual production, where we need to account for factors such as resources, use purposes, and figuring out how to tailor our dataset and model to fit out purposes. I look forward to continuing my journey in machine learning and leveraging the latest AI technologies to build more innovative solutions.

I would like to thank Nullspace for the opportunity to work on this project and am excited to see what’s next for the company as it pushes the boundaries of robotics education.

--

--

Imperial College Computational Bioengineering Student and Deep learning Engineer. I write about machine learning and software product development. And more