eXtreme Deep Factorization Machine(xDeepFM)

The new buzz in the recommendation system domain

Abhishek Sharma
Towards Data Science

--

Photo by Nathan Dumlao on Unsplash

We are living in times where we are spoilt for choice. Whether it is food, music or entertainment the amount of options that we have is simply mind-boggling. But thanks to the recommendation engines that fuel these apps/sites these alternatives are provided to us in a ranked list. In this blog, we will be discussing a new recommendation algorithm called Extreme Deep Factorisation Machines (xDeepFM).

This blog is organized as follows:

  1. Current recommendation systems in production
  2. Introducing xDeepFM
  3. Introducing CIN (the core pillar of xDeepFM)
    3.1 CIN characteristics
    3.2 CIN hidden layers
    3.3 Similarity with RNN
    3.4 Similarity with CNN
    3.5 Pooling and concatenation of hidden layers
    3.6 Output of xdeepfm
  4. CIN complexity
    4.1 Space Complexity
    4.2 Time Complexity

5. Coded Example using deepCTR library

6. References

7. Summary

Let’s start!!

1. Current recommendation systems

The current recommendation landscape is dominated by FM/DNN based models. But some good hybrid architectures which fuse FM and DNN based systems are also coming up.

1. Factorization Machine (FM) based approach

  • +ve: Learns pattern on combinatorial features automatically
  • +ve: Generalize well to unseen features
  • -ve: Tries to capture all feature interactions which results in learning of useless interactions. This might introduce noise.
  • Examples: the most trusted workhorse of the recommendation domain

2. Deep Neural Network ( DNN ) based approach

  • +ve: Learns sophisticated and selective feature interactions
  • -ve: Feature interactions are modeled at an elemental level. One hot encoding is used for categorical variables to represent them in dimension D. This will be fed into a fully connected layer. This is in stark contrast to FM based approaches which models feature interactions at a vector level (User vector * Item Vector).
  • Example: DNN part of Neural Collaborative Filtering(NCF), DNN for Youtube recommendations

3. DNN + FM (Hybrid) approach

  • +ve: Learns both low/high order feature interactions
  • Examples: Wide & Deep Network(Youtube), Neural Collaborative Filtering(NCF), Deep and Cross network(DCN), Deep Factorisation Machine (DeepFM) and eXtreme Deep Factorisation Machine (xDeepFM)

You can look into this post to get a brief overview of Neural Collaborative Filtering (NCF). This is the most cited of all hybrid models.

All examples of the hybrid category use DNN to learn implicit bitwise feature interactions. They differ in how the high order feature interactions are learned.

2. Introducing xDeepFM

Figure 1: The architecture of xDeepFM

xDeepFM comprises of 3 parts:

  1. The linear model ( Directly work on top of raw input features )
  2. Plain DNN (Works on top of dense feature embeddings)
  3. CIN (Works on top of dense feature embeddings)

Out of these 3, CIN is unique to xDeepFM.

3. Introducing Compressed Interaction Network (CIN)

The problem with DNN based systems is that they learn the high order interactions implicitly. In xDeepFM the explicit high order feature interactions are learned through Compressed Interaction Network (CIN).

3.1 CIN characteristics

CIN is inducted by xDeepFM due to the following benefits:

  • It learns feature interactions at a vector wise level, not at a bitwise level.
  • It measures high order feature interactions explicitly.
  • Its complexity does not grow exponentially with the degree of interactions.

3.2 CIN hidden layers

The explicit feature interactions are learned by the CIN through its hidden layers (Every x in Figure 4 is a hidden layer) which are defined as follows

Equation 1

X(k-1,i): Embedding vector of the ith field in k-th layer (k-1 hidden layer)
X(0,j): Embedding vector of the jth field in k-th layer (base embedding/original feature embedding)
W(k,h,i,j): Learnable parameter with dimension m * H(k-1)

Every row in X(k) is differentiated only by W(k,h). In contrast to DNN there is no activation to transform the hidden layer.

3.3 Similarity with RNN

Equation 1 is similar to RNN based models as CIN’s output is dependent on the last hidden layer and additional input.

3.4 Similarity with CNN

Figure 2: Outer products along each dimension for feature interactions

m: Number of rows in the original feature matrix
H(k): Number of rows in the feature matrix at hidden layer k

Z(k+1) is an intermediate tensor which is an outer product of hidden layer X(k) and original feature matrix X(0). Z(k+1) can be regarded as a special type of image and W(k,h) as a filter.

Figure 3: The k-th layer of CIN. It compresses the intermediate tensor from H(K)*m to H(k+1)

As you slide the Weight Matrix along the dimension D (size of every feature embedding) you get a hidden vector X(k+1,i).

Figure 4: This is the expansion of CIN component of Figure 1

T: Depth of the network

3.5 Pooling and concatenation of hidden layers

The result of the hidden layers is summed row-wise and then concatenated together before being fed to the output layer.

  1. Every hidden layer has a connection with the output units
  2. Sum pooling is applied at every hidden layer to create a pooling vector as:
Equation 2:
Pooling Vector of length H(k) at layer k

All pooling vectors from hidden layers are concatenated before connected to output units:

Concatenation of pooling vectors of all layers

3.6 Output equation of xDeepFM

Linear, CIN, and DNN are all trained parallelly with the output equation as follows:

Equation 3: Output Function of CIN

Sigma: Sigmoid Function
a: Raw Features
W,b: Learnable parameters

DNN and CIN based layers complement each other well as together they can learn

1. Both implicit (DNN) and explicit (CIN) feature interactions.

2. Both low (DNN) and high (Both) order feature interactions.

4. CIN Space and Time Complexity

4.1 Space Complexity

Space Complexity of CIN

First-term represents the number of parameters at the output layer
Second-term represents the number of parameters at every layer k
If we assume all hidden layers to have the same number of feature vectors H. The number of parameters can be represented as O(m*H*H*T)

4.2 Time Complexity

Time complexity is equivalent to O(mHD * H * T).
mhd: Cost of computing Z(k+1,h) for 1 row
H: Number of feature vectors (rows) at layer H
T: Number of hidden layers in CIN

With that, we are done with the theory of xdeepfm. It’s time to see it in action.

5. Coded Example

I have used xDeepFM implementation from the deepCTR library. For some other implementations check out the References segment.

Download the sample data from here and then use the following code to read the data and separate features as dense or sparse.

The sparse features require embedding while the dense features require normalization. We are using MinMax scaler for normalization while LabelEncoder for feature encoding.

This is how sparse_feature_columns[0] look
# creating a dense feat
dense_feature_columns = [DenseFeat(feat, 1) for feat in dense_features]
This is how dense_feature_columns[0] looks like

In the next code blob, we are initializing, compiling, training and predicting using xDeepFM

6. References

  1. xDeepFM paper
  2. Some of the ready-made xDeepFM implementations are:-
    1. Code from the writers of paper xDeepFM
    2. Code from Microsoft library for recommendation
    3. deepctr which is used in the following implementation

7. Summary

xDeepFM is an example of a hybrid architecture. It combines MF with DNN to come up with a better performance. It does this by using CIN(Compressed Interaction Network). CIN has 2 special virtues:

  1. It can learn bounded degree feature interactions effectively.
  2. It learns feature interactions at a vector-wise level.

Like other popular hybrid methods, xDeepFM incorporates CIN with DNN.
Due to which high order feature interactions can be learned both explicitly and implicitly. This reduces the need for manual feature engineering.

I will urge you to try out Microsoft’s library or the xDeepFM’s author’s code to play around with different xDeepFM’s implementations.

You can also look at this post to get a brief overview of NCF which is another popular hybrid architecture.

Please feel free to share your thoughts and ideas in the comments section.

--

--