Textual Novelty Detection

How to use Minimum Covariance Determinant (MCD) to detect novel news headlines

Ilia Teimouri PhD
Towards Data Science

--

Photo by Ali Shah Lakhani on Unsplash.

In today’s information age, we’re inundated with news articles on a daily basis. Many of these articles are merely restatements of the same facts, but some contain genuinely new information that can have a major impact on our decision-making. For example, someone looking to invest in Meta may want to focus on articles that contain exclusive information, rather than those that simply reiterate previously published data. It’s crucial to be able to distinguish between news that is novel and news that is redundant, so that we can make informed decisions without being overwhelmed by the deluge of information.

This is where novelty detection comes it. Novelty detection refers to the task of identifying new or unknown data that differs from previously seen data. It is an unsupervised learning technique used to detect anomalies, outliers, or new patterns in data. The key idea is to build a model of “normal” data, and then use that model to identify data points that deviate from normal.

In the context of news articles, this involves detecting whether an article contains new information that is not available elsewhere. To do this, we can perhaps develop a baseline of what is known or available, and then compare new information to that baseline. If there are significant differences between the new information and the baseline, then we can say that the information is novel.

Minimum Covariance Determinant (MCD)

Minimum Covariance Determinant (MCD) method is a technique for estimating the covariance matrix of a dataset. It can be used to create an elliptical shape that encapsulates the central mode of a Gaussian distribution, and any data points that lie outside of this shape can be considered as novelties (sometimes referred as anomalies). The MCD method is particularly useful for datasets that are noisy or have outliers, as it can help to identify unusual data points that may not fit the overall pattern of the data. (see example).

MCD can be used to detect novelty in news headlines. While the method can be generalized to full articles, our aim is to provide a concise example of applying MCD for novelty detection on short texts. MCD is a robust estimator of multivariate location and scatter, making it well-suited for identifying outliers in high-dimensional data such as text. On a dataset of news headlines, MCD will learn a model of “normal” headlines based on covariance. We can then use this model to score new headlines and flag those that significantly deviate from the norm as potential novel or anomalous stories. The sample code and experiments will illustrate how MCD novelty detection works in practice.

Step-by-Step Approach

Embedding: In machine learning we use embedding as a way to represent data in a more compact and efficient form. Embedding transforms raw data into a lower-dimensional representation that captures the most important features of the data.

Text embedding is a specific type of embedding that is used to transform text data into a vector representation. It takes into account the semantics and relationships between words, phrases, and sentences, and converts them into a numerical representation that captures the meaning of the text. This allows us to perform operations such as finding similar text, clustering text based on semantic meaning, and more.

Suppose we gather the following headlines about Meta in the past couple of months:

news = [
"Mark Zuckerberg touts potential of remote work in metaverse as Meta threatens employees for violating return-to-office mandate",
"Meta Quest 3 Shows Us the Metaverse Dream isn’t Dead Yet",
"Meta has Apple to thank for giving its annual VR conference added sizzle this year",
"Meta launches AI chatbots for Instagram, Facebook and WhatsApp",
"Meta Launches AI Chatbots for Snoop Dogg, MrBeast, Tom Brady, Kendall Jenner, Charli D’Amelio and More",
"Llama 2: why is Meta releasing open-source AI model and are there any risks?",
"Meta's Mandatory Return to Office Is 'a Mess'",
"Meta shares soar on resilient revenue and $40bn in buybacks",
"Facebook suffers fresh setback after EU ruling on use of personal data",
"Facebook owner Meta hit with record €1.2bn fine over EU-US data transfers"
]

We can use OpenAI to generate text embedding for each of the sentences as:

def get_embedding(text, 
model = 'text-embedding-ada-002'):
text = text.replace("\n", " ")
return openai.Embedding.create(input = [text], engine = model)['data'][0]['embedding']

df['embedding'] = df.news.apply(lambda x: get_embedding(x))
df['embedding'] = df['embedding'].apply(np.array)

matrix = np.vstack(df['embedding'].values)
matrix.shape

# Output: (10, 1536)

The text-embedding-ada-002 model from OpenAI is a cutting-edge embedding model that takes a sentence as input and outputs an embedding vector of length 1536. The vector represents the semantic meaning of the input sentence, and can be used for tasks such as semantic similarity, text classification, and more. The latest version of the model incorporates state-of-the-art language representation techniques to produce highly accurate and robust embeddings. If you do not have access to OpenAI, you can use other embedding models such as Sentence Transformers.

Once we produce the embedding, we make a matrix variable that stores a matrix representation of the embeddings from the df[‘embedding’] column. This is done by using the vstack function from the NumPy library, which stacks all of the vectors (each representing a single sentence) in the column vertically to create a matrix. This allows us to use matrix operations in the next step.

Compute MCD: We use the embeddings as features and compute the MCD to estimate the location and shape of the central data cloud (central mode of a multivariate Gaussian distribution).

Fit an Elliptic Envelope: We then fit an elliptic envelope to the central mode using the computed MCD. This envelope acts as a boundary to separate normal points from the novel ones.

Predict Novel Sentences: Finally, we use the elliptic envelope to classify the embeddings. Points lying inside the envelope are considered normal, and points lying outside are considered novel or anomalous.

To do all this, we use EllipticEnvelope class from scikit-learn in Python to apply the MCD:

# Reduce the dimensionality of the embeddings to 2D using PCA
pca = PCA(n_components=2)
reduced_matrix = pca.fit_transform(matrix)
reduced_matrix.shape

# Fit the Elliptic Envelope (MCD-based robust estimator)
envelope = EllipticEnvelope(contamination=0.2)
envelope.fit(reduced_matrix)

# Predict the labels of the sentences
labels = envelope.predict(reduced_matrix)

# Find the indices of the novel sentences
novel_indices = np.where(labels == -1)[0]
novel_indices

#Output: array([8, 9])

contamination is a parameter that you can tune depending on how many sentences you expect to be novel. It represents the proportion of outliers in the dataset. The predict method returns an array of labels, where 1 denotes inliers (normal points), and -1 denotes outliers (novel points).

Furthermore, to visualise the high-dimensional embeddings in 2D as well as saving computation time, we use PCA to project the high-dimensional embedding vectors to a lower-dimensional 2D space, we denote this by reduced_matrix.

We can see that novel_indices outputs array([8, 9]), which are the sentence indices that are found to be novel.

Plotting the result: we can visualise the result by plotting the embeddings and the elliptic envelope. The inliers (normal points) can be plotted with one color or marker, and the outliers (novel points) can be plotted with another. The elliptic envelope can be visualized by plotting the ellipse that corresponds to the Mahalanobis distance.

To achieve the visualisation we:

  1. Extract the location and covariance matrix of the fitted elliptic envelope model.
  2. Compute the eigenvalues and eigenvectors of the covariance matrix to determine the orientation and axes lengths of the ellipse.
  3. Compute the Mahalanobis distance of each sample from the center of the fitted ellipse model.
  4. Determine a threshold distance based on the contamination parameter, which specifies the expected percentage of outliers.
  5. Scale the width and height of the ellipse based on the threshold Mahalanobis distance.
  6. Label points inside the ellipse as inliers and outside as outliers.
  7. Plot the inliers and outliers, adding the scaled ellipse patch.
  8. Annotate each data point with its index to identify outliers.
# Extract the location and covariance of the central mode
location = envelope.location_
covariance = envelope.covariance_

# Compute the angle, width, and height of the ellipse
eigenvalues, eigenvectors = np.linalg.eigh(covariance)
order = eigenvalues.argsort()[::-1]
eigenvalues, eigenvectors = eigenvalues[order], eigenvectors[:, order]
vx, vy = eigenvectors[:, 0]
theta = np.arctan2(vy, vx)

# Compute the width and height of the ellipse based on the eigenvalues (variances)
width, height = 2 * np.sqrt(eigenvalues)

# Compute the Mahalanobis distance of the reduced 2D embeddings
mahalanobis_distances = envelope.mahalanobis(reduced_matrix)

# Compute the threshold based on the contamination parameter
threshold = np.percentile(mahalanobis_distances, (1 - envelope.contamination) * 100)

# Scale the width and height of the ellipse based on the Mahalanobis distance threshold
width, height = width * np.sqrt(threshold), height * np.sqrt(threshold)

# Plot the inliers and outliers
inliers = reduced_matrix[labels == 1]
outliers = reduced_matrix[labels == -1]

# Re-plot the inliers and outliers along with the elliptic envelope with annotations
plt.scatter(inliers[:, 0], inliers[:, 1], c='b', label='Inliers')
plt.scatter(outliers[:, 0], outliers[:, 1], c='r', label='Outliers', marker='x')
ellipse = Ellipse(location, width, height, angle=np.degrees(theta), edgecolor='k', facecolor='none')
plt.gca().add_patch(ellipse)

# Annotate each point with its index
for i, (x, y) in enumerate(reduced_matrix):
plt.annotate(str(i), (x, y), textcoords="offset points", xytext=(0, 5), ha='center')

plt.title('Novelty Detection using MCD with Annotations')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
plt.show()

Finally, we get the visualisation for inliers and outliers as:

Plotting the envelope and points labeled as inliers or outliers

Now let us visit the headlines, 8 and 9 are:

Facebook suffers fresh setback after EU ruling on use of personal data.

Facebook owner Meta hit with record €1.2bn fine over EU-US data transfers.

Both headlines are related to the European Union’s efforts to regulate how Meta use and transfer personal data on their platforms.

While the inlier headlines are mostly about how Meta is going all-in on AI and virtual reality. The AI focus is evident in the release of a new AI chatbot, and the virtual reality focus is evident in the release of the new Meta Quest 3 headset. You can also notice that 0th and 6th headlines are about work from home setup and hence they are closer to each other on the plot.

Summary

In this post we have shown how one can distinguish between Normal Points and Novel Points based on distribution. In short, Normal Points are the points that lie in the high-density region of the data distribution, i.e., they are close to the majority of the other points in the feature space. Meanwhile, Novel Points These are the points that lie in the low-density region of the data distribution, i.e., they are far from the majority of the other points in the feature space.

In the Context of MCD and Elliptic Envelope, Normal Points are points that lie inside the elliptic envelope, which is fitted to the central mode of the data distribution. While, Novel Points lie outside the elliptic envelope.

We learned also that there are parameters that are influencing the outcome of MCD, these are:

  • Threshold: The decision boundary or threshold is crucial in determining whether a point is normal or novel. For instance, in the Elliptic Envelope method, points inside the envelope are considered normal, and those outside are considered novel.
  • Contamination Parameter: This parameter, often used in novelty detection methods, defines the proportion of the data expected to be novel or contaminated. It affects the tightness of the envelope or threshold, influencing whether a point is classified as normal or novel.

We should also note that in the case of new articles, since each news article comes from a different week, the novelty detection method should consider the temporal aspect of the news. If the method does not inherently account for the temporal order, you may need to incorporate this aspect manually, such as by considering the change in topics or sentiments over time, which would be beyond the scope of this post.

--

--