Knitting and Recommendations

How Computers Think: Part Three

Simon Carryer
Towards Data Science

--

I don’t, as it turns out, know nearly as much as I need to about knitting. Looking back, it seems absurdly naive of me to assume that I could pick up in a couple of hours of browsing the internet all the knowledge I’d need to make meaningful statements about such a complex field. It is a dangerous kind of hubris to believe that a good dataset and a smart algorithm can entirely substitute for expert knowledge, yet it is all too common. It is born of a combination of bullish confidence in the power of mathematics to unravel complex problems, and a kind of chauvinism about the relative complexities of “hard” engineering and mathematical problems compared to “soft” social and interpersonal problems. It results in a lot of ill-conceived start-ups, and a lot of well-intentioned attempts to address intractable social issues with a phone app. Happily, my own foray into blithe ignorance had substantially lower stakes.

My plan for this essay was to demonstrate two kinds of recommendation algorithm. Recommendations are maybe the most common, and definitely the most visible manifestation of artificial intelligence on today’s internet. Large online retailers, trying to squeeze every possible dollar out of browsers, will try to tempt you to an extra purchase by showing you ostensibly relevant items. Online content providers, trying to keep you on their site, prompt you with another article or video, on what they guess is a similar topic. Music sites have made maybe the deepest investments in providing good recommendations. Understanding the complexities of musical taste, when the factors that make one song great and another terrible are so nebulous and intangible, is a problem they are all competing to solve.

Broadly speaking, recommendation algorithms are about calculating similarity. They extrapolate from one or more items that a user likes, to find a set of similar items that the user will also like. In terms of how they calculate this similarity, most of these recommendation systems fall into two broad categories. The first is perhaps the most obvious. “Content-based” recommendations assume that two items are similar if they have similar qualities. To make a content-based recommendation, you need to know a lot about the things you’re recommending. That’s easy if you’re recommending things with clear, comparable and quantifiable properties. Comparing one refrigerator to another, for example, is straightforward: Size, wattage, freezer compartment or no, ice-cube maker, etc. Comparing a refrigerator to a blender is more difficult. Content-based recommendations also often miss more subtle context. If you have recently purchased a refrigerator, the algorithm is liable to assume that you must really like refrigerators, and therefore that you want to start a collection of them.

The second kind of algorithm, called “collaborative filtering” is more subtle. It assumes that two things are similar if the same people like them. In other words, if you and I both like refrigerators, and we both like blenders, then that means that, to some extent, refrigerators and blenders must have something in common. Obviously, if you look at a sample of just two people, this approach is subject to considerable error. That you and I happen to share an enthusiasm for both country music and nature documentaries does not imply a strong connection between those two things. But with a big enough group of people, it becomes remarkably reliable. By aggregating the opinions of a large numbers of users, spurious connections tend to be overwhelmed by the larger numbers of more meaningful connections — in general, people shopping for refrigerators are more likely to also look at blenders than they are to look at, say, an arc-welder. These algorithms are often much better than content based approaches at identifying subtle cultural markers that are difficult to label. For this reason they are often used for recommending things like music, or television shows. The difference between, for example, The Beatles and The Monkees might be difficult for a content-based algorithm to parse. They’re both four-piece pop groups from the 1960’s. They have substantially the same instrumentation, a similar upbeat sound, and they both made an unfortunate late-career foray into psychedelica. An algorithm looking solely at these facts might assume that these two groups are very similar. But a fan of the Beatles is unlikely to be pleased to hear the Monkees pop up on their playlist. A collaborative filtering algorithm is able to detect these differences — and might instead suggest something that sounds quite different from the Beatles, but has a similar fanbase. Collaborative filtering algorithms have a couple of weaknesses. Firstly, the fact that they rely on user behaviour means that they’re vulnerable to over-interpreting unusual outliers. A small number of users with unusual habits can sometimes skew the data in unpredictable ways. Secondly, and related to the first point, they need a lot of interactions with a new item before they can make a sensible suggestion. An item that has only been seen by a handful of users will have a suggestion based on a small dataset, and will therefore potentially reflect the idiosyncrasies of that small set of users. This is called the “cold start problem”, because the algorithm needs a lot of data to get “warmed up” to provide good recommendations.

With those two algorithms in mind, let’s get back to my embarrassing failure. To demonstrate these two approaches to recommendations, I needed a large dataset of items, information about some properties of those items (to form the basis of the content based algorithm), and also information about users, and their interactions with these items (to build the collaborative filtering algorithm). After some time sifting about the internet and talking to some friends, I settled on Ravelry, a website for sharing knitting patterns, as the source of my data. Ravelry is probably the internet’s largest collection of knitting and crochet patterns online, and its vast numbers of users upload, comment on, share, download, and save hundreds of thousands of patterns, all over the world, all year round. They have a huge collection of exactly the kind of data I needed. One very audacious email later, and the extremely helpful people at Ravelry were setting me up with the means to download as much of this data as I needed.

Thanks, Ravelry!

Now. This is where I should have paused. Taken stock. Thought about the project I was embarking on, made an honest assessment of my knowledge of the subject matter, and recruited some expert assistance. Reader, I did not. The consequences of this would not become apparent until much later.

To help users find patterns on their site, Ravelry records a great deal of “metadata”. This includes things like the “difficulty rating” of the pattern, keywords about the style, if the pattern is knitting or crochet, if it’s a jumper or a pair of mittens, as well as a whole host of more obscure metrics. For our purposes, this would form the basis of the “content based” algorithm. We’re going to build an algorithm that can, given one knitting pattern, find another that is similar, based on the metadata of those patterns. Ignoring the more esoteric information available on the site, I downloaded the most important metrics for several hundred patterns, and went to work preparing the data.

Here’s how the major features of a pattern — in this case for an adorable knitted soft toy — appears on Ravelry’s site:

Mr. Dangly, in a contemplative mood

Name: “Mr. Dangly”

Category: animal

Difficulty: 1.9

Craft: knitting

Keywords: fringe, seamed, written-pattern, worked-flat

To turn this blob of unstructured information about a knitting pattern into a dataset suitable for our algorithm requires a series of mathematical tricks and transformations. To start, with the exception of “Difficulty”, all of this data is in the form of English words, which are as meaningful to a computer as a string of binary ones and zeroes are to you and I. Our first task is to turn these words into numbers. To do that we use a trick called “One-hot” encoding.

One-hot encoding gets its name from a technique in electronics engineering. Imagine you have a machine that can be in one of two states: On, or off. You can easily indicate the state of the machine with just one light. If the light is on (or “hot”), the machine is on. If the light is off, so is the machine. But what if the machine has three states? Say, on, off, and “warming up”. Now it’s more complicated to show the machine’s state. You need a second light. No lights means the machine is off, the first light means the machine is warming up, and if the second light is on, the machine is ready. If both lights are on, something’s wrong with your machine. For every state your machine can be in, you need to add an extra light. This is not the most efficient way to encode information — you need a lot of lights if the machine is very complex — but its advantage is its reliability. It’s hard to accidentally signal the wrong state, and it’s easy to tell if something’s gone wrong — there’s more than one light on at a time.

We can use the same technique to encode our category data. Instead of a machine that can be in various states, we have a single column of data, each row of which indicates one of several categories. Instead of lights, we translate our single column of data into multiple columns, one for each of the categories, ‘shawl-wrap’, ‘pullover’, ‘beanie’, ‘cardigan’, ‘scarf’, etc. For each row, all of these columns have a value of zero, except for the one that corresponds to the category of our item, which is set to one. We can determine the category of each item in the dataset by looking at which column has a value of one.

The keyword data — things like “fringe”, “granny-squares” and so on — receives essentially the same treatment: One column for each possible keyword, with each row containing a zero if that item does not include that keyword, and a one if it does. This creates a very “wide” table, with a lot of columns. Our stuffed friend, Mr. Dangly from above, might be represented (in part) by the following row in our table:

Now that we’ve encoded this data, we can use it to make content based recommendations.

Recommendations of this sort are essentially about calculating similarity. We want to find the most similar pattern to Mr Dangly. To do that, we’re going to use a measure called “Euclidean distance”.

Euclidean distance measures distance just as we’d measure distance in real life — the length of a line between two points. Just as we can calculate the distance between two sets of latitude and longitude coordinates on a map, we can calculate the distance between two rows of numeric data in a table. This is easy to visualise if we imagine our dataset has only two columns, “Difficulty” (2 for Mr. Dangly), and “Animal” (1 for Mr. Dangly — he’s a monkey). We can plot this on a two-dimensional chart, and compare it to three other hypothetical patterns: “Mr Tricky” — a stuffed animal, like Mr. Dangly, but more complex to make, with a difficulty rating of 3; “Easy Jumper” — easy to make, like Mr. Dangly, but not a stuffed animal; and “Tricky Gloves” — not like Mr. Dangly at all, neither easy to make, nor a stuffed animal.

Our data for these hypothetical patterns looks like this:

The distance calculations become clear when we draw them on a two-dimensional chart:

Figure 1: Euclidean distances

The euclidean distance between Mr. Dangly and these three other patterns is literally the length of the line between them, just as you’d measure it with a ruler. The “Tricky Gloves”, by virtue of having a longer, diagonal line to Mr. Dangly, has the largest euclidean distance, and is the least similar. “Easy Jumper” and “Mr. Tricky” are equally distant from Mr. Dangly — they’re equally similar.

The fascinating — and confusing — thing about euclidean distance is that the calculation holds for any number of extra dimensions. You can imagine adding a third, height axis to this chart, representing maybe one of the keywords. The line in three-dimensional space would be longer for those patterns that did not share the keyword. It gets harder (for me, at least) to visualise this in more than three dimensions, but I am assured that the maths is entirely sound, and practice bears this out.

This chart also exposes one of the problems with this approach. You’ll notice that the euclidean distance from Mr. Dangly to both “Mr Tricky” and “Easy Jumper” are exactly equal. Our algorithm assumes that both of these patterns are equally similar to Mr. Dangly, that they are equally good recommendations. Is this a good assumption? Is the difference in one point of difficulty equivalent to the difference between a stuffed animal and a jumper? Intuitively, it feels like no. But how can I know? If these two differences are not of the same magnitude, how different are they? How much should I weight difference in difficulty compared to a difference in category? I am completely unqualified to answer. This is the point at which I (far, far too late) elected to enlist some outside help, so I could understand more about how knitters choose patterns. The result was… humbling.

It turns out that all of the obscure metrics I had previously discarded, like “yarn weight”, “needle gauge” and “yardage” were in fact vitally important to many knitters. I’ve been a fool! Knitters will often look for patterns that work with the yarn they already have, rather than purchasing new yarn for each pattern. That means that their choice of patterns is often constrained by the kind of yarn required to make the pattern. Leaving them out of my algorithm means that the recommendations it generates are often superficially acceptable, but don’t reflect knitters’ actual preferences at all.

To fix these issues, first I had to go back to the source of the data and, cursing myself, download it all again, including the extra metrics I now realised were vital to the analysis. Then, in consultation with my hastily-recruited knitting experts, I created weightings — a value to scale each metric by to reflect its importance to knitters in choosing patterns. With these improvements, I could make recommendations that started to make sense to actual knitters.

The final algorithm chiefly considered the craft of the pattern (whether it was knitting, crochet, or something else), the category, the yarn weight, and the difficulty. But it also took into account keywords, the needle gauge, and a host of other factors. For each of these, it would calculate a “distance” between patterns. The more factors they had in common, the smaller the distance.

The “Spring Collection”

For Mr. Dangly, the knitted soft-toy monkey from above, this algorithm recommended the “Spring Collection”, a set of soft toy patterns including a hedgehog, a frog, a bunny, and a lamb. They were similarly easy to knit, employed similar techniques — a seamed construction, and use of fringing (for Mr. Dangly a dapper tuft of hair, and for the Spring Collection a hedgehog’s spines). In many respects, this is a very good recommendation. But there was something not quite right. Mr. Dangly has a certain eccentricity about him, a degree of whimsy or style that makes him stand out. By contrast, while superficially similar to Mr Dangly, the Spring Collection are missing his unique charm. They seem a little bland, and slightly twee. I feel like friends of Mr. Dangly are unlikely to get on with the members of the Spring Collection. The algorithm has made a suggestion which is correct on paper, but which misses some of the nuance of the patterns. It misses something that no human could fail to see. It misses their personality.

That left the collaborative algorithm to try to do better. Collaborative filtering algorithms are very different to their content-based cousins. Where our content-based algorithm uses metadata about the patterns themselves, a collaborative filtering approach looks at information about the people who interacted with those patterns. Ravelry allows users to “favourite” patterns, and store a list of their favourites. These lists are public, which meant I was able to download lists of the “favourite” patterns of several hundred thousand Ravelry users. Individually, these lists are idiosyncratic and not very informative. Someone who likes a chunky knitted scarf might also have an interest in novelty tea cosies, and this shouldn’t imply anything to us about either scarves or chintzy home decor. But in aggregate, in the kinds of volumes I was able to retrieve from Ravelry, this information can become very useful indeed. Because collaborative filtering uses the comparatively more complex and subtle information about human behaviours and preferences, my hope was that I’d be able to more accurately capture something about the elusive Mr. Dangly.

In many ways this algorithm was much more simple to construct, although the volume of data was much larger. Where the content-based algorithm used a dataset in which each column represented some aspect of the pattern, for the collaborative algorithm, I built a dataset in which each column represents a user, and the values for each row reflected whether that user had “favourited” that pattern:

With this dataset, we can calculate similarity in exactly the same way as we did with the content-based algorithm. Each user is a “dimension” in the euclidean distance calculation. The patterns that are liked by the same users are considered similar.

The first results of this algorithm were, to my eye, not promising. A tiny knitted purse was, according to the algorithm, very similar to a lace-edged washcloth. A pair of cabled socks was similar to an intricate shawl. It made no sense, until I spoke to my knitting experts, who pointed out more subtle similarities that my untrained eye had missed. The purse and the washcloth were both small projects that used a variety of techniques, patterns a new knitter might use to develop their skills. The socks and the shawl both used two colours of fine yarn, which might be preferred by someone looking to use up their existing stock of wool. But other recommendations were baffling even to my expert knitters. Patterns that had very few interactions — new patterns or obscure ones — tended to get idiosyncratic or plain wrong recommendations, the product of the algorithm having very little information on which to base its similarity calculation. This is a weakness of the collaborative algorithm — the “cold start problem” — rearing its head. There are some mathematical tricks you can use to get around this problem, and I ended up using a lot of these (I won’t go into them here in any detail), but the algorithm continued to give poor recommendations for any item without a good number of user interactions.

The Socktopus

But how did it fare with our friend Mr. Dangly? Fortunately, Mr Dangly’s peculiar charms had garnered him sufficient attention to create a good set of recommendations. And the leading recommendation was an ideal illustration of the strengths of a collaborative recommendation. “Socktopus” is, as the name implies, an octopus who wears socks. He’s easy to make, and uses some similar techniques to Mr. Dangly, but is also missing some of Mr Dangly’s features — there’s no seam, and no fringing. But crucially, the Socktopus has something of the same air of whimsy, the same sense of personality, as Mr. Dangly. The collaborative algorithm has found something that the content-based algorithm hasn’t — not just the raw information, but how people relate to that information. It understands something about their meaning.

For this reason, collaborative algorithms are hugely popular with anyone trying to recommend complex cultural products, whether that is books, movies, or music, or even knitting patterns. But when online platforms make extensive use of collaborative recommendations to guide what content they show to users, they risk falling victim to another pernicious weakness of this kind of algorithm. Collaborative algorithms are very good at finding subtle connections between items, at defining the boundaries of taste. They can distinguish between sub-genres of music, so that Death Metal fans are never offended by hearing Doom Metal (which, I am assured, is a totally different thing). They can finely parse topics of news articles, so that someone who follows their local sports team is not bothered with local politics. But this policing of boundaries comes at a cost. They can create a “bubble” effect.

A bubble effect is when users are only shown a small slice of a much larger pool of content, and never become aware of anything outside their tiny portion. When an algorithm is driven by what users interact with, and users only see what the algorithm shows them, there is a feedback loop. Users interact with what they are shown, and they are shown what they (or users like them) interact with. Without care, users can find themselves wrapped in a cocoon of “safe” content, only ever seeing the same sorts of things. This is a risk for online platforms commercially. If users never discover new things they might like, and if they never have an opportunity to expand their tastes, they get bored and move on. But this also has a more subtle social risk. In a feedback loop, people are only shown content that suits their taste, that they agree with, content that fits their existing worldview. It is unpleasant to encounter things which challenge what you believe about the world; to hear music you don’t like, watch films about things you don’t understand, or read news articles about terrible events. But encountering and attempting to understand new and challenging things is how we learn to accommodate and cooperate with people different from ourselves. By walling ourselves away with like-minded people, seeing only things we like, we wall ourselves off from the wider world.

It’s easy to make assumptions about something when you don’t know a lot about it. I underestimated the complexity and nuance of knitting and its enthusiasts. I’d never really encountered it before, and it was very uncomfortable to have that ignorance exposed. Both collaborative and content-based recommendation algorithms can only extrapolate patterns from what they’ve seen in the past. They, like me, are apt to make assumptions based on what they already know. This is not a problem with the algorithms themselves — the maths is unimpeachable. The problem is with the people building the algorithms, and with the people using them. My downfall with the content-based algorithm was assuming that a lot of data and some clever maths were enough to understand a complex field. The risks with the collaborative algorithm are more subtle, but they’re also born of a kind of arrogance, a belief that we can completely anticipate a users’ needs, that they are as simple as finding something like what they already enjoy. In both cases, embracing complexity, and adopting a more holistic approach, can help mitigate these risks.

This is a theme we’ve seen with the previous algorithms, and we’ll see again in future essays. It’s a common feature of machine learning algorithms. They are only as clever as the data that was used to create them. If they are to truly learn, truly create something new, rather than just reflect our existing knowledge and biases, then it is us that must teach them. It is us that must confront our ignorance. It is us that must learn.

The previous article in this series, “Linear Regression and Lines of Succession” is available here. Code for this article can be found in my github, here. The next article will be published in April.

To make your own Mr. Dangly or Socktopus, go to www.ravelry.com (account required).

--

--