How Computers See: Image Recognition and Medieval Pole Arms

Simon Carryer
Towards Data Science
15 min readJul 7, 2019

--

Death, they say, is the great equaliser. It comes for all of us eventually, and none of us escape its icy grasp. This was as true throughout history as it is today, but on a medieval battlefield I imagine there was particular poignancy to the aphorism. Death was ever-present, and could arrive in a multitude of guises — on the end of a sharp piece of steel, beneath the hooves of charging horses, or from one of the innumerable diseases you can get from drinking water that someone else just shit into.

Yet medieval noblemen, in the tradition of rich assholes since the beginning of time, were ceaseless in their attempts to tip life’s scales in their favour. And by the end of the 15th century, that aristocracy came up with something pretty effective. Advances in metalworking and an increase in the availability of high-quality steel saw the introduction of increasingly elaborate, and effective, suits of plate armour. Enormously heavy, and appallingly expensive, these were largely employed by mounted knights — landowners wealthy enough to afford both a horse and armour, and the upkeep of both. This armour made one virtually immune to the slash of a sword or the thrust of a spear, and the lucky wearer could indulge in an afternoon of peasant-slaying with relative impunity.

What a jerk

Less pleased by this state of affairs were, of course, the peasants. Rather unsportingly, from the noble point of view, they developed a set of tools for mitigating with this new battlefield threat. These tools all relied on one revolutionary piece of technology, which almost completely nullified the advantage of plate armour. This key innovation is something you would recognise today — it’s what we call a “very long stick”. Put some sharp steel at the end of a six or eight foot long wooden pole, and you had the perfect tool for spoiling a nobleman’s fun. These weapons were called “pole arms”.

As to what was best to adorn the Very Long Stick, there were four main schools of thought:

The spike: A spike is an excellent thing to have on the end of a long stick. The spear is perhaps the preeminent example of this, and has been in use since the dawn of humanity. The back end of the pole can be set into the ground, using a charging knight’s momentum against him, turning him into a well-armoured shish-kebab.

The blade: A sword or axe is fairly ineffective against plate armour. But put that same blade at the end of a long pole, and suddenly the force you can apply is multiplied exponentially. A blow that was before a gentle tap can become an almighty whang, cleaving even the strongest armour.

The hammer: Applying a huge amount of force very abruptly to someone’s head will ruin their day, regardless of whether the blow penetrates their helmet or not. A hammer focuses all the power of a blow into a very small area, transferring that impact through the armour to the wearer, and greatly upsetting their plans.

The hook: Derived from a variety of existing farm implements, a hook could find chinks in armour that a straight blade might miss. These weapons were also used to pull mounted opponents from their seats, or drag footmen away from their friends for a more convenient murder.

In practice, pole arms often combined several of these approaches into a single, multi-use implement — an axe blade with a hook on the back, a hammer with a long spike out the top, or a long heavy blade that could both pierce and slash. Survival of the fittest in the constant warfare of the late medieval period saw a great flourishing of different shapes, all intended to make horrible things happen to someone as far away as a long pole could manage. So many of these weapons were created, in fact, that keeping track of them all is quite a challenge.

Determining whether a particular curlicue of steel is a glaive, a fauchard, a brandistock or maybe a bohemian earspoon would stretch anyone’s brain power. Lucky for us, in these modern times, we can call upon the power of artificial brains. Yes, this is yet another situation crying out for the application of machine learning.

What we want is a tool that can identify which category a pole arm belongs to, simply from an observation of its shape. We want to turn an image of a pole arm into a prediction about its proper name. This is a classification problem, like we encountered in the very first essay in this series, where we classified dinosaurs as either herbivores or carnivores, depending on certain physical characteristics.

But there is one very important difference. In the case of our dinosaurs, our data was very conveniently encoded as a set of relevant “features” about each creature — its weight, length, whether it had feathers, and so on. For our pole arms, our data is simply a set of images — the shape of the pole arm encoded in the colour value of each pixel.

Let’s think about what this means. Here’s how our data about dinosaurs looked:

With this data, the algorithm could determine a relationship between a feature and the diet of the dinosaur: walking on four legs makes a dinosaur very likely to be a herbivore; weighing less makes it more likely to be a carnivore, and so on. By combining these factors, the algorithm creates a set of rules that very accurately puts dinosaurs into the correct group.

But our pole arm data is a set of images — instead of defined features and categories, all we have is rows and rows of pixels. If we represent this data as rows and columns, it looks like this:

Top row: White, greyish, black, very black, grey again, white, off-white, white…

Second row: Greyish black, blackish grey, black, black, more black, grey, light grey, white…

No clear relationships exist between features — the content of a particular pixel — and classes — the kind of pole arm pictured. Does the fact that the pixel third from the left in the top row is black, rather than white, bear any clear relationship to whether the image is a bardiche or a bec-de-corbin? No it does not. To a traditional classification algorithm this information is completely meaningless. Finding these relationships requires a different, and more sophisticated approach. We’re going to delve into a realm of machine learning we’ve not yet covered in these essays: “Neural Networks”, or so-called “Deep Learning”.

What neural networks do, which separates them from other machine learning algorithms, is learn not just how to turn meaningful features into a prediction about a row of data, but how to turn complex, unstructured information into meaningful features. Neural networks take a multi-stage, or “layered” approach to classification. Our pole arm identifier first prepares the data — converting the raw values of the pixels into abstracted information about the image. It is only in the final stage that it translates that information into a likelihood of the pole arm belonging to each class — perhaps it is 80% likely to be a halberd, 15% a bec-de-corbin, and 5% a glaive-guisarme. The highest-likelihood class becomes our final prediction.

That final step in the process is exactly the same as the simplest classifier algorithms. It is the earlier “prepare the data” step which really sets neural networks apart. How do they do that? The answer, for most neural networks, is a process called “back-propagation”. It’s called this because it involves the final layer, which predicts the class “feeding back” information to the first layer. The first layer is made up of a set of very simple operators, which are called “neurons”. They get this name because the way they operate mimics, at a basic level, the way the brain’s neurons function. But don’t be fooled by the sophisticated-sounding name — their operation is incredibly simple. The neurons look at the pixels in the image, and based on those pixels’ values, they pass on a single signal of their own.

When the neural network is first created, the neurons choose their thresholds completely at random — they make a guess. The network makes a prediction for a whole set of training data, and then it checks how well it did. The neurons which provided useful information get to keep their values, but those that led the network astray have their values adjusted. Over very many iterations, the neurons are slowly trained to discern what information to keep, and what to discard. By the end of the training process, they have learned what features to pass on to the final layer to maximise its chances of predicting the right class.

To achieve this for our pole arms, we’ll need a very large set of images of pole arms, each labeled with the correct name. We also need a wide variety of different pole arms of each class. We want to make sure that the network is learning the general principles of what distinguishes say, a fauchard from a bardiche, and not just learning to recognise details of the particular images we chose.

Gathering this dataset turned out to be a huge amount of work, not just because I had to painstakingly search, crop, and filter dozens of images by hand, but because it turns out that there’s no consensus on what any of these things are called. The average medieval peasant, it seems, was more concerned with staying alive than with the correct nomenclature for the weapon that helped them do so. A “fauchard” then, is confidently defined by one source as a cousin of the glaive, with the addition of a rear-facing spike or hook. Another, equally authoritative source claims the same weapon is a modified scythe — a forward-curving blade on the end of a pole.

I needed an authoritative source, and as I have done many times before, I turned to the Advanced Dungeons and Dragons 2nd Edition Player’s Handbook (Revised). Here is how that august tome defines the seven classes of weapon my model will classify:

Bardiche: One of the simplest of pole arms, the bardiche is an elongated battle axe. A large curving axe-head is mounted on the end of a shaft 5 to 8 feet long.

Bec de corbin: An early can-opener designed specifically to deal with plate armor. The pick or beak is made to punch through plate, while the hammer side can be used to give a stiff blow. The end is fitted with a short blade for dealing with unarmored or helpless foes.

Fauchard: An outgrowth of the sickle and scythe, the fauchard is a long, inward curving blade mounted on a shaft six to eight feet long.

Glaive: One of the most basic pole arms, the glaive is a single-edged blade mounted on an eight- to ten-foot-long shaft.

Guisarme: Thought to have derived from a pruning hook, this is an elaborately curved heavy blade.

Glaive-guisarme: Another combination weapon, this one takes the basic glaive and adds a spike or hook to the back of the blade.

Halberd: Fixed on a shaft five to eight feet long is a large axe blade, angled for maximum impact. The end of the blade tapers to a long spear point or awl pike. On the back is a hook for attacking armor or dismounting riders.

Those definitions will infuriate many military historians, I am sure, but for me they will suffice.

As well as collecting dozens of examples of each weapon type, I expanded my dataset in another way, by “synthesising” extra images. That meant taking my existing images, flipping and stretching them, shifting them left and right, and speckling them with random noise. This meant that each single image I collected could be included many times in my dataset. By using stretched, manipulated, and especially flipped images, we help the algorithm focus on general shapes and relationships in the images, rather than specific details. Finally, the images are desaturated (all colour is removed), and shrunk to only 40 pixels on each side. This reduces the amount of data the algorithm has to consider, and dramatically increases the speed at which it can learn.

Like many things in machine learning, it takes a bit of tinkering to make it work right. Neural networks take a range of parameters and settings, governing several esoteric aspects of their operation. A network can have more than two layers, for example, boiling down the raw data into an ever-richer soup of meaningful features. I will spare you the details of this.

Choosing these values is still a somewhat arcane practice, more akin to alchemy than chemistry. But the testing process is exactly the same as for any other classification algorithm. Before we train the model, we set aside a proportion of the images, hiding them from our model. These hold-outs don’t need to go through the process of stretching, speckling, and flipping. Instead, we use them to check the accuracy of our neural network. After learning what it can from a set of training images, can it correctly predict the classes of a set of images it has never seen before? For each image, we ask the network to guess its class. It returns a list of probabilities — how likely it is for each image that it belongs to a given class.

Predicted classes for each of 25 test images — the image is shown next to the predicted probabilities of it belonging to each class. The correct class is in blue, and the label is red if the class was incorrectly predicted. In most cases, the model predicts the correct class with close to 100% probability.

Our algorithm is extremely accurate! It makes a correct prediction for all but one of our forty or so test images — identifying a bec-de-corbin as a halberd. With simple classes such as these, and such clear and small images, the recognition task is very straightforward for our algorithm.

With such good accuracy, it’s difficult to trust that the network has really learnt to identify pole arms, and is not just memorising some trivial facet of the images we’ve supplied to it. Is there a way we can dig into the internal workings of the algorithm, to better understand how it arrives at its predictions?

You’ll recall from a previous essay, about extracting meaning from text, that we could use a series of mathematical operations to turn a collection of words describing a movie into a set of numerical values which contained some representation of the “meaning” of that movie. With our neural network we’ve done something very similar — taken an image of a pole arm, and turned that into numerical representations of some meaningful information about that image. It’s just a series of numbers, but it contains some information about the shapes in the image. It turns out that these numerical representations of our pole arms have some very interesting properties, and they can help us understand more about how the network makes its classifications, and what exactly it has learned.

Just like with our movies, we can calculate similarity between our pole arm representations (or “embeddings”). We can measure the difference between the numerical values, and use that to find the most or least similar images. That reveals some interesting stuff. If we simply measure the difference between the pixels in the image, we find images which are superficially similar, but which can represent quite different pole arms. By contrast, finding similar embeddings can find images that are quite different, but which represent a similar design of weapon.

In the above image, we’ve taken a typical guisarme, and found the most similar images based on image similarity, i.e. how many pixels they have in common, and based on embedding similarity, i.e. the difference between their numerical representations. Based on image similarity, the closest match is not a guisarme at all, but rather a fauchard, which happens to occupy a similar space in the frame. But the embedding similarity has found another guisarme. Interestingly, it’s found this guisarme despite the fact that it’s facing the opposite direction to the original. That demonstrates something called rotational invariance. Because we trained our model on flipped, stretched, and speckled transformations of our source images, it learnt to ignore those factors — it has learnt that a guisarme is a guisarme regardless of whether it faces left or right.

Another thing we can do with these embeddings of our images is to calculate averages. For example, we can take all the embeddings for our glaives, and take the average of them. That gives us a new embedding which represents the “glaiviest” possible glaive — the “ur-glaive”. But we can’t really see what that glaive looks like — it’s just a string of numbers. What we can do is find the glaives that are closest to that ideal. We can take the glaives from our test set and order them from most to least “glaivy”. I’ve done that in the image below, with the most similar on the left, and the least on the right. The left-hand glaive is extremely simple, and has no unusual features. By contrast, the glaive on the right has all kinds of strange features — hooks and spikes. It could almost pass as a volgue or bardiche.

You’ll notice though that the order of the images doesn’t quite match how you or I might sort them. The fourth glaive, for example, looks to me very similar to the first, though the embeddings are apparently not so similar. This is an important reminder — the network works in mysterious ways. Though its results might sometimes coincide with our expectations, it can also sometimes confound those expectations. It is not a human, and we should not expect human-like conclusions from it.

We can do other maths on these embeddings. We can add them together. If we take the “glaiviest” glaive, and the “guisarmiest” guisarme, and add their embeddings together, we can then find the image that best represents a combination of glaive and guisarme. Happily, the result of this operation is a glaive-guisarme — a glaive that has been embellished with a guisarme-like hook.

What have we learnt? We’ve learnt to respect the inventiveness of medieval peasants, and we’ve also learnt a little about neural networks. By use of a “hidden layer”, neural networks are able to extract meaningful data from very complex sources — simple images in this case, but also sound, film, and — as we’ll see in a future essay — text. They can use that meaningful information to make (often very accurate) classifications of new data. Facial recognition software, more sophisticated recommendation algorithms, document classification, and many other systems all leverage neural networks in this way. But the embeddings that neural networks generate are also very useful, and power a whole host of other applications: Google’s image search uses — in part — information extracted from images by a neural network. Chatbots employ neural networks to encode the meaning of questions and answers.

It is tempting to ascribe human-like qualities to the intelligence of neural networks. After all, they are in some ways imitations of human brain structures. It would not be entirely surprising if they were able to mimic human-like behaviours. But it’s important to remember that these systems have a very narrow field of knowledge — they’re trained to do one thing only — and they don’t care how they do it. Our pole arm recogniser is very good at its job, and our experiments with the embeddings it generates shows that it to some extent recognises some of the same features in the images that we do. But we also saw that some of its results were quite surprising. It made choices a human would not. It is just as likely to pay attention to minute details in the images as to the larger structures; there’s nothing in its programming telling it what features it’s “supposed” to recognise in the images, so it only cares about what works. That’s a trait that’s common to all artificial intelligences, and it’s a quality that makes them both fascinating and sometimes frightening. By performing human-like tasks in a sometimes very unusual way, they can feel like an insight into a truly alien way of thinking — if they can be said to “think” at all.

Thanks for reading! The previous essay in this series “Ten Thousand First Dates: Reinforcement Learning Romance” is available here. All the code for this essay is available from my github, here. The next essay in this series — on text generation — is available here.

--

--