Machine Learning From Scratch: Part 3

Arrays and representations

Published in

Towards Data Science

12 min readMar 6, 2018

Part 3 introduces arrays. This family of higher-order collections allows us to describe images and text documents in a format that can be processed by machine learning algorithms.

Along the way, we will discuss sentiment analysis, an important application of natural language processing that is used for market research and reputation management.

Arrays

Last time, we introduced the idea of higher-order collections: collections that are organized in collections.

Arrays are the most important family of higher-order collections in machine learning. They are used to represent images, text documents and many other types of data.

Arrays have three important properties. I will first enumerate them and then go into more detail:

Arrays are lists of one or more dimensions.
All lists on a particular level have the same format.
We will assume that all of the elements in an array are numbers.

Vectors, matrices and 3D arrays

A one-dimensional array is simply a list. An array of two dimensions is a list of lists. And a three-dimensional array is a list of lists of lists. We will not use any arrays with more than three dimensions in this article.

One-dimensional arrays are called vectors. Two-dimensional arrays are called matrices. We will not introduce a special term for three-dimensional arrays and simply refer to them as 3D arrays.

Same level, same format

In an array, all lists on a given level have the same format.

Consider the following example: [ [ 1, 2 ], [ 3, 4 ], [ 5, 6 ] ]. This is a list of three lists. Each of the three inner lists has the same number of elements: 2. This collection, therefore, qualifies as an array.

In contrast, the collection [ [ 1, 2, 3], [ 4 ], [ 5, 6 ] ] does not qualify as an array, because the inner lists have different lengths: 3, 1 and 2, respectively.

Numbers only

The individual items in an array are known as its entries (the term I will use) or elements.

All entries are assumed to be numbers.

Image representation

We will first discuss how matrices are used to represent grayscale images. A later section will extend our discussion to color images. True to the name of the series, we will proceed one step at a time.

An image is a collection of pixels arranged on a grid.

Pixels are the smallest elements presented on a screen and the lowest level of analysis in computer vision. A pixel is characterized by the following attributes:

a position on the grid
one or more measurements related to the intensity of light

Representing a grayscale image

Consider the following example of the digit 3:

To break this image down into pixels, we can impose a grid that divides up the image into cells:

Each cell in this grid will be considered as one pixel.

Pixels differ in the intensity of light which we can measure on a grayscale from 0 to 255. A value of 0 corresponds to black and a value of 255 is white. The values in-between are different shades of gray.

Using measurements on this scale, we can switch from the visual representation to a numerical representation:

Generalizing to other images

So far, we looked at one specific image with a height of 7 pixels and a width of 6 pixels.

To generalize this representation to all images of this format, the specific value can be replaced with variables:

We can denote the height of the image with the letter m and the width of the image with the letter n. In the previous example, we had m = 7 and n = 6.

Using this notation, we can now generalize from a 7 x 6 image to an m x n image.

If remove the gray borders and add square brackets around the grid, we arrive at a matrix representation.

Matrices as tables of numbers

A matrix is an outer list of one or more inner lists.

Fig. 6 illustrates that the entries in a matrix form m rows and n column. There are two ways to look at this: we can either think of the rows as the inner lists and consider the matrix as a list of rows or we can focus on the columns and regard the matrix as a list of columns.

Viewed from the first perspective, the inner lists are stacked on top of each other. For example, in the matrix [ [ 1, 2 ], [3, 4] ], the list [ 1, 2 ] can be stacked on top of the list [ 3, 4 ] to form a table. Alternatively, we can consider the two lists as the columns of the table.

An individual entry in a matrix is denoted as a_ij. [I’m using the underscore “_” to indicate a subscript.]

The letter i stands for the index of the row and the letter j corresponds to the column. For example, the entry in row 3 and column 2 is denoted as a_32. In Fig. 3, we have a_32 = 255.

3D arrays for RGB images

Color images are most often described through the RGB model in which the color of each pixel is expressed as a triple (r, g, b), with the three items corresponding to the intensity of red light, green light and blue light, respectively.

To numerically represent an RGB image, we use a 3D array. Whereas a two-dimensional array forms a table, a 3D array can be thought of as a cube:

This covers the fundamentals of image representation for machine learning approaches to computer vision.

Next, we will discuss how vectors relate to matrices and why they are suitable for the representation of text documents.

Vectors as special cases of matrices

Recall the following the two facts:

A vector is a list of numbers.
A matrix can be thought of as a table, arranged in rows and columns.

Combining both facts, a vector can be seen as a particular row or column in a matrix.

A row vector is a matrix with a single row (m = 1). A column vector is a matrix with a single column (n = 1).

Row vectors, column vectors and matrices can be enclosed in square brackets or parentheses. Commas are usually left out.

The matrix in Fig. 3 consists of 7 row vectors and 6 column vectors. We can, for example, extract the following row vector from the second row of this matrix: [ 255 0 0 0 0 255 ].

We denote an individual vector entry with x_i and the number of entries with d.

Vectors and matrices are arguably the two most important data structures in machine learning. Here is a quick comparison:

Sentiment analysis

When you publish an opinion about a brand online, computational systems will likely analyze your statement, find out whether it expresses a positive or a negative view and use the result for market research and reputation management purposes. This application of natural language processing (NLP) is called sentiment analysis.

An important empirical finding is that simple approaches to sentiment analysis can achieve excellent performance and are hard to outperform.[1, 2] After all, many customer make it a point to be understood by anyone willing to read.

I will use sentiment analysis to explain how vectors can be used to represent documents.

Document representation

NLP at the movies

Movie reviews are a popular data source in sentiment analysis research.[3] They are easy to obtain in large numbers, pose interesting challenges that will be discussed later in articles and can help predict box-office revenues[4].

While the approach outlined in this section can be effectively applied to hundreds of thousands of reviews of different lengths, we will use only a small collection of reviews that express clear opinions. The focus here is on the concepts. The machines can do the heavy lifting.

The plan is to start with an initial vocabulary and pass the words in the vocabulary through a series of operations that yields a set of relevant features. One vector is generated for every document in the corpus and each entry in a document vector will correspond to one of the selected features. But first we should clarify some terms.

Corpora, documents and words

A corpus is a collection of documents.

In NLP, a document is any collection of words that has an attribute that we are interested in.

A document can be as short as a single word or sentence and as long as an entire book series or website.

In the case of sentiment analysis, we are interested in the polarity of the opinion expressed in the document. Does it express a positive or a negative opinion?

By the way, I will use the term word to refer to sequences of symbols. Under this definitions, numbers and punctuation marks count as words.

Initial vocabulary

Here is the corpus that we will work with:

I looove this movie.
I hate this movie.
Best movie I’ve ever seen.
What a disappointing movie.

The alphabetically sorted vocabulary of this corpus contains the following words:

{ a, Best, disappointing, ever, hate, I, I’ve, looove, movie, seen, this, What, . }

Before we proceed, I suggest you pause for a moment and think about the following question: Would you use every single word in this vocabulary to perform sentiment analysis? Or might there be some words that we can delete or change to improve pattern recognition?

Pipelines

In many NLP projects, the initial vocabulary is passed through a pipeline: a series of operations in which the output of one step is the input to the next step.

The particular sequence of operations described in this section simplifies the data and eliminates irrelevant words.

Warning! Proceed with caution! While many of the steps mentioned below work quite reliably across different tasks and domains, they can filter out relevant words when applied too aggressively.

Function words, such as pronouns, are a case in point. These are words that mainly contribute to the syntax of a sentence. In most applications, we can remove function words without causing a significant loss of information. For some psycholinguistic applications, however, these words can provide useful features.[5] For example, one study found that similarity in the use of function words predicts romantic interest and relationship stability.[6]

With the obligatory warning out of the way, here is an overview of the entire process:

The first step we apply can be called normalization. Contracted forms (such as I’ve) are expanded (I have), while elongated words (such as looove) are reduced to their conventional form (love). Personally, I have nothing anything against looove. It’s just that patterns easier to recognize when all expressions of love have the same number of o’s. The normalized vocabulary looks like this:

{ a, Best, disappointing, ever, hate, have, I, love, movie, seen, this, What, . }

The next step is conversion to lowercase. Does it make a difference for sentiment analysis whether the first letter in the words best and what is uppercase? — No, not really:

{ a, best, disappointing, ever, hate, have, i, love, movie, see, this, what, . }

Another common step is to remove non-alphabetic words. The full stop, for example, carries little relevant information. Let’s get rid of it:

{ a, best, disappointing, ever, hate, have, i, love, movie, see, this, what }

Part 2 mentioned that some words, including articles, determiners, prepositions and basic verbs occur in just about every text: in positive reviews and in negative reviews, in movie reviews and in non-movie reviews, in the previous sentence, in this sentence, in the next sentence … you get the idea. These words are called stop words. Let’s delete them:

{ best, disappointing, ever, hate, love, movie, see }

And there are some words that behave like stop words within a particular domain, even though they are not considered to be stop words in general. For example, the words movie and see occur in a large fraction of movie reviews, without providing clues about the polarity of an opinion. I will refer to these words as domain-specific stop words. Deleting these words is the final step in the pipeline and yields the following result:

{ best, disappointing, ever, hate, love }

Overall, we have eliminated 8 out of 13 members of the vocabulary and have arrived at what I think is an intuitively plausible set of words that are relevant for sentiment analysis.

Binary document vectors

Using the five remaining words (best, disappointing, ever, hate, love) as features, we can now represent each of the documents as a vector with five entries. For every feature, there is a corresponding vector entry.

Binary vectors are an appropriate choice for short documents. These are vectors whose entries are all 0 or 1.

We assign a feature value of 1 if the feature (word) is present in the document and a value of 0 if the feature is absent.

Consider the third document as an example (Best movie I’ve ever seen.). Two of the five features are present in this document: the first feature (best) and the third one (ever). Consequently, we set the first and the third entry to 1 and the other entries to 0. This gives us the vector [ 1 0 1 0 0 ].

Applying the same procedure to the every document in the corpus, we obtain the following vectorial representations:

I looove this movie. → [ 0 0 0 0 1]
I hate this movie. → [ 0 0 0 1 0 ]
Best movie I’ve ever seen. → [ 1 0 1 0 0 ]
What a disappointing movie. → [ 0 1 0 0 0 ]

Congratulations! If you are reading this, you have learned how to represent images and text documents in a format that can be processed by machine learning algorithms.

Next time, we will discuss functions — and I want you to be as enthusiastic about this subject as I am.

The models used in machine learning to predict unknown attributes all have the form of functions. Neural networks, including the deep neural networks you may be reading about on a regular basis, are essentially sequences of functions that are applied to arrays of numbers.

Thank you for reading! If you’ve enjoyed this article, hit the clap button and follow me to get the next articles in this series.

References

[1] Wang, S. and Manning, C.D., 2012, July. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2 (pp. 90–94). Association for Computational Linguistics.

[2] Li, B., Zhao, Z., Liu, T., Wang, P. and Du, X., 2016. Weighted neural bag-of-n-grams model: New baselines for text classification. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers (pp. 1591–1600).

[3] Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y. and Potts, C., 2011, June. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 142–150). Association for Computational Linguistics.

[4] Asur, S. and Huberman, B.A., 2010, August. Predicting the future with social media. In Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology-Volume 01 (pp. 492–499). IEEE Computer Society.

[5] Pennebaker, J.W., Francis, M.E. and Booth, R.J., 2001. Linguistic inquiry and word count: LIWC 2001. Mahway: Lawrence Erlbaum Associates, 71(2001), p.2001.

[6] Ireland, M.E., Slatcher, R.B., Eastwick, P.W., Scissors, L.E., Finkel, E.J. and Pennebaker, J.W., 2011. Language style matching predicts relationship initiation and stability. Psychological science, 22(1), pp.39–44.