Disclaimer: The content of this post is purely informative and cannot be considered an investment recommendation. The analyzed data may contain errors
Introduction
In this post, I’m going to do an exploratory data analysis around the users health of two of most famous DeFi Lending Protocols. If you are not familiar with the user’s health and how this indicator is important to avoid liquidations, I wrote this post explaining it with an example of liquidation on Aave.
Getting the data
To get the data I’m going to use one of my favorites protocols, TheGraph. In this previous post I explained how to get data using their subgraphs. For the Compound protocol, all the data comes directly from the subgraph, in case of Aave, it’s needed to use the aavejs library to transform some data.
The fields that will be part of the dataset are the followings
- Health: users health, calculated with the difference of total ETH collateral deposit and total ETH borrowed, a health value below one allows to liquidate the user collateral.
- Protocol: protocol where the user has invested.
- totalBorrowETH: Total amount of ETH borrowed by the user, sum of all assets borrowed in ETH price.
- totalDepositETH: Total amount of ETH deposited by the user, sum of all assets borrowed in ETH price.
- assetsBorrowedCount: Number of assets borrowed by the user.
- assestsDepositedCount: Number of assets deposited by the user.
- stableBorrowedCount: Total of borrows which principal is a stable coin.
- stableDepositedCount: Total deposits which collateral is a stable coin.
- countLiquidated: Number of times that the user has been liquidated.
The users selected should be those who have currently more than one borrow active.
After creating our dataset, we can see how the data is filled by running.
df_users.head()

Cleaning the data
The first part is clean the data and detect outliers, to see how is the health data distributed first we can run
df_users.health.describe()
This will return the following values for the health variable

As we can see, there is a min value of -1, this value is not possible for the health variable, so we are going to delete it, to do so, we can run
df_users.drop(df_users[df_users['health']<0].index, inplace=True)
Running the same command again, we can see that this value is not part of the data anymore

Now we’re going to inspect the size of the dataset, and how many users have borrow money from Aave and how many in compound, we can see it with the following code
aave_borrowers = df_users[df_users.protocol=='AAVE'].shape[0]
compound_borrowers = df_users[df_users.protocol=='COMPOUND'].shape[0]
user_size = df_users.shape[0]
print("Total users in the dataset {}".format(user_size))
print("Number of Aave borrowers {}".format(aave_borrowers))
print("Number of Compound borrowers {}".format(compound_borrowers))
The results are
Total users in the dataset 8615
Number of Aave borrowers 1145
Number of Compound borrowers 7470
Detecting outliers
To see of the health of the users is distributed, we can use the boxplot chart to detect outliers in the data, running
sns.boxplot(x=df_users['health'])

We can see that there are some very risk averse users that can be consider outliers, so we can use the IQR method to remove this outliers, to do so, we can execute
Q1 = df_users.health.quantile(0.25)
Q3 = df_users.health.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
3.902646865965031
And remove them for the dataset
df_users = df_users[~((df_users.health < (Q1–1.5 * IQR)) |(df_users.health > (Q3 + 1.5 * IQR)))]
print("Total users in the dataset after remove outliers {}".format(df_users.shape[0]))
Total users in the dataset after remove outliers 7297
After that, we can see that have been removed from the dataset, looking again at the boxplot we can see that most of the users are between 1 and 3

Histograms
After clean the data and remove outliers, we are interested to see how the user risk is distributed, to see how the users are managing the risk, we can see the histograms of the health in each protocol
df_users['health'].hist(by=df_users['protocol'])

Most of them are close to the liquidation limit, with a few of them with a less risky position, to take a closer look where most of the users are, can ran the same histograms but filtering by where most of them are positioned, by running
df_users[df_users['health']<4].health.hist(by=df_users['protocol'])

We can see that most of them are between 1 and 2, to confirm that, can use the boxplot for this field in each protocol
df_users.boxplot(column='health', by='protocol')

Correlations
The variables of the dataset were not randomly selected, the idea was to try to find some kind of correlation between the users health and the behavior of them in the protocols. One strategy could be, use stable coins to have less volatility risk, other diversify the collateral and principal using several assets, and also could be interesting to see if users with high or less collateral are more risky or if a previous liquidation could make a user more risk averse.
Once that all these variables are in the dataset, we can use the heatmap to plot the correlations between the variables by running
corr = df_users.corr()
sns.heatmap(corr)

There is not a high correlation between these variables and the users health, the higher values are for the countLiquidated and the stableDepositCount, that shows a negative correlation, indicating that when a user has deposited less stable coins or has been liquidated more times, the health tends to be higher.
To examine this variables, we can plot the scatter of this, by running
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df_users['health'], df_users['stableDepositedCount'])
ax.set_xlabel('health')
ax.set_ylabel('stableDepositedCount')
plt.show()

We can see that when the users have deposited more stable coins, the health used to be more concentrated near to 1, and the same happens to the borrow stable variable.

In case of the liquidation count, there is not a clear correlation, there are users with several liquidations and also taking high risk, and users with no liquidations and lower risk. My respects for those with more than 25 liquidations and still close to one on health.

Conclusion
With this simple analysis we can get some insights of the behavior of the users in lending protocols, again we can see how easy is to get data from DeFi and analyze it and not only the actual data, in a Blockchain we have all the history saved in each block.