I recently competed in a local AI competition where the challenge involved human pose classification with 15 different classes. This was my first ever AI competition and the experience was humbling. My team came in overall 2nd-runner up based on our model accuracy, creativity and teamwork. I will definitely recommend learners to attend competitions or hackathon as they serve as a great networking platform and training ground to hone your technical skills. Here I will bring you through our model building process that ultimately helped us achieved a high 83% accuracy on the final test set.
The challenge given to us was to develop an image classification algorithm that could differentiate between 15 different human poses.
The poses were;
ChairPose, ChildPose, Dabbing, HandGun, HandShake, HulkSmash, KoreanHeart, KungfuCrane, KungfuSalute, Salute, WarriorPose, EaglePose, ChestBump, HighKneel, and Spiderman
Under the Competition terms and conditions, we do not have the rights to release the dataset. Hence, all images shown below were taken by us during the competition.

Each class within the training dataset contains approximately 100 images while there were 25 images of each class in the validation set. This is a relatively small amount of data for a 15 classes classification task, explaining the need for us to increase the dataset by taking our own images.
OpenPose is a human pose estimation and a feature extraction step to detect human within an image. Keypoints of the individual bodyparts are identified using the model and a human skeleton can be drawn, connecting these keypoints. Extracting the human pose from each image served as a preprocessing step to reduce noise in our data.
With heavy reference from this site, we generated keypoints of each image using the pre-trained MPII model.
cwd = os.getcwd()
# Specify the paths for the 2 files
protoFile = "{}/pose_deploy_linevec_faster_4_stages.prototxt".format(cwd)
weightsFile = "{}/pose_iter_160000.caffemodel".format(cwd)
nPoints = 15
POSE_PAIRS = [[0,1], [1,2], [2,3], [3,4], [1,5], [5,6], [6,7], [1,14], [14,8], [8,9], [9,10], [14,11], [11,12], [12,13] ]
# Read the network into Memory
net = cv2.dnn.readNetFromCaffe(protoFile, weightsFile)
frameWidth = 640
frameHeight = 480
threshold = 0.1
# Forward training set into OpenPose model to generate output
inWidth = 299
inHeight = 299
m = train_images.shape[0]
train_outputs = np.zeros((1,44,38,38))
for i in range(m):
inpBlob = cv2.dnn.blobFromImage(train_images[i], 1.0/255, (inWidth, inHeight),(0, 0, 0), swapRB=True, crop=False)
net.setInput(inpBlob)
output = net.forward()
train_outputs = np.vstack((train_outputs,output))
outputs = np.delete(train_outputs,(0),axis=0)
H = train_outputs.shape[2]
W = train_outputs.shape[3]
print(train_outputs.shape)
# Generate keypoints for training set
m = 973
H = 38
W = 38
train_points = np.zeros((m,15,2))
for sample in range(m):
for i in range(nPoints):
# confidence map of corresponding body's part.
probMap = train_outputs[sample, i, :, :]
# Find global maxima of the probMap.
minVal, prob, minLoc, point = cv2.minMaxLoc(probMap)
# Scale the point to fit on the original image
x = (frameWidth * point[0]) / W
y = (frameHeight * point[1]) / H
if prob > threshold :
train_points[sample,i,0] = int(x)
train_points[sample,i,1] = int(y)
Next, using keypoints stored in the variable train_points , a human skeleton was drawn on every image.
# Processed images with sticks on original image
train_processed = np.copy(train_images).astype(np.uint8)
for sample in range(m):
for point in range(nPoints):
if train_points[sample,point,0] !=0 and train_points[sample,point,1] !=0 :
cv2.circle(train_processed[sample], (int(train_points[sample,point,0]), int(train_points[sample,point,1])), 10, (255,255,0), thickness=-1, lineType=cv2.FILLED)
# draw lines
for pair in POSE_PAIRS:
partA = pair[0]
partB = pair[1]
if train_points[sample,partA,0] != 0 and train_points[sample,partA,1] != 0 and train_points[sample,partB,0] != 0 and train_points[sample,partB,1] != 0:
cv2.line(train_processed[sample], (int(train_points[sample,partA,0]),int(train_points[sample,partA,1]))
, (int(train_points[sample,partB,0]),int(train_points[sample,partB,1])), (255,255,0), 3)
# Processed images with sticks on a black background
train_processed_grey = np.zeros((m,train_images.shape[1],train_images.shape[2],1)).astype(np.uint8)
for sample in range(m):
for point in range(nPoints):
if train_points[sample,point,0] !=0 and train_points[sample,point,1] !=0 :
cv2.circle(train_processed_grey[sample], (int(train_points[sample,point,0]), int(train_points[sample,point,1])), 10, (1), thickness=50, lineType=cv2.FILLED)
# draw lines
for pair in POSE_PAIRS:
partA = pair[0]
partB = pair[1]
if train_points[sample,partA,0] != 0 and train_points[sample,partA,1] != 0 and train_points[sample,partB,0] != 0 and train_points[sample,partB,1] != 0:
cv2.line(train_processed_grey[sample], (int(train_points[sample,partA,0]),int(train_points[sample,partA,1]))
, (int(train_points[sample,partB,0]),int(train_points[sample,partB,1])), (1), 3)
With this step, we had obtained 3 different datasets;
- Original image
- Original image + skeleton overlay
- Skeleton with a blank background

Transfer Learning allows us to train deep neural networks using significantly less data than normally needed to train the algorithm from scratch. It also often leads to a higher accuracy model as information from large dataset was transferred and train to fit into our data. For this competition, my team decided to test the performance of 3 pre-trained model namely ResNet-50, ResNeXt-101 and PNASnet-5. All models were built using the Fastai library.
import fastai
from fastai.metrics import error_ratefrom torchvision.models import *
import pretrainedmodels
from fastai.callbacks.tracker import SaveModelCallback
from fastai.vision import *
from fastai.vision.models import *
from fastai.vision.learner import model_meta
bs = 8
# Importing of dataset
data = ImageDataBunch.from_folder(base_dir,train='train',valid='val', ds_tfms = get_transforms(), size =299, bs=bs).normalize(imagenet_stats)
#ResNet-50
models.resnet50
#ResNeXt-101
def resnext101_64x4d(pretrained=True):
pretrained = 'imagenet' if pretrained else None
model = pretrainedmodels.resnext101_64x4d(pretrained=pretrained)
all_layers = list(model.children())
return nn.Sequential(*all_layers[0], *all_layers[1:])
#PASNet-5
def identity(x): return x
def pnasnet5large(pretrained=True):
pretrained = 'imagenet' if pretrained else None
model = pretrainedmodels.pnasnet5large(pretrained=pretrained, num_classes=1000)
model.logits = identity
return nn.Sequential(model)
# Training of model
learn = cnn_learner(data, model, metrics = accuracy)
learn.fit(20,callbacks=[SaveModelCallback(learn,monitor='accuracy',every="improvement",name ="top_acc")])
learn.lr_find()
learn.recorder.plot()
learn.unfreeze()
learn.fit_one_cycle(5,callbacks=[SaveModelCallback(learn,monitor='accuracy',every="improvement",name="top_acc_1")])
We built 9 models in total, each of the pre-trained models with the individual dataset we generated. Below are the results;

For the final test set, we predicted the classification of the unknown dataset using PNASnet-5 model trained on the original model and attained an accuracy of 83%, winning us 3rd place in the competition.
And there you have it, human pose classification using PNASnet-5 pre-trained model. It was disappointing that OpenPose feature extraction did not improve the model accuracy but I believe we did well given the time constraint of the competition. For those that would like to know how PNASnet-5 performed so much better than the other pre-trained model, below is a summary of the algorithms by the authors.
