What moves Bitcoin?

Predicting Bitcoin through news headlines.

Jerónimo aranda barois
Towards Data Science

--

By Pablo Garza and Jerónimo Aranda.

  • Alternative data exploration (NLP)
  • Predictive models
  • Portfolio Simulation (30% yield)
  • Conclusions
  • Jupyter notebook code reference at the end

Introduction

Bitcoin is a decentralized cryptocurrency without an administrator or central bank. It was created in 2008 by an unknown user called “Satoshi Nakamoto”. Bitcoin can be acquired through “mining” or through a direct sale between users, but the price does not depend on an underlying or some intrinsic value, the price is determined by supply and demand.

The process to mine a Bitcoin requires two steps:

  1. Verify that the information of 1 MB of transactions is valid, this can be a transaction or thousands of transactions depending on the information of each one. This part is relatively fast and once step two is completed, this 1 MB block will be added to the public information of the transactions.
  2. Once the transaction information has been validated, the user who is mining the Bitcoin becomes eligible to win the bitcoins, but to keep the number of blocks added to the “Blockchain” constant over time, a process called “Proof-of-Work“ is carried out. In this process users who are trying to mine require finding a “hash”, a 64-bit hexagecimal number that meets the requirements established by the program. Finding this number has a probability of 1 in 13 trillion, so it requires a lot of computational capacity and resources to find it. Once the “target hash” is found then the new block is accepted and the user who found the hash receives a Bitcoin award.

In this way, users who mine Bitcoin keep the transactions valid and therefore, fraud becomes unfeasible due to the amount of resources and computational power needed to be able to find the target hash faster than others.

The prize obtained for mining is divided by half every time 210,000 blocks added, which takes about 4 years. The objective of this is that Bitcoin adds value due to its shortage, at some point, a finite number of Bitcoins will exist without the possibility of mining more, but to keep the incentive for the miners and to keep the transaction process safe the miners are paid a commission, which increases as the prize for mining decreases.

Figure 0. Bitcoin price through time.

On the other hand, the Bitcoin movement over time has been very volatile. Less than three years ago the price of Bitcoin was around 1,000 USD, while the price as of December 1, 2019 is around 7,400 USD, an appreciation of 640%, but not only this, the price reached levels of almost 19,000 USD before falling 25% in less than 2 days. Therefore, Bitcoin has been subject to large changes in price, a volatility that makes investments in this cryptocurrency very risky and unattractive for investors seeking more moderate returns with controlled risk.

This is how we come to our working hypothesis:

Is it possible to predict the movement of Bitcoin through news analysis?

The objective of this article is to implement different predictions through statistical models to try and see if there is a relationship between the words that appear in the news and the Bitcoin movement. Additionally, we seek to see if it can help us predict the future movement of the cryptocurrency and monetize these predictions through portfolios. Taking advantage of the liquidity of the market we will observe the evolution of a Capital of USD $ 100,000 over a dynamic position in Bitcoin.

Exploratory data analysis

It is important to mention that Bitcoin is not exchanged in a regulated exchange, but is exchanged directly between users, “peer-to-peer”, the vast majority of them are Wallet users. For this reason, Bitcoin can always be exchanged, at any time of the day, in order to measure a daily change an opening (open) and closing price (price) approach of traditional markets was taken.

Remember that our objective variable is the change in open vs. Price, a categorical variable. Accompanied by this variable there are others that define the bitcoin market for each day. These are high, low among others. However, these variables don't really correlate with our objective variable, it is clear that time series models could better interpret these variables, however, this is not the focus of this project.

It is here that the concepts of natural language processing (NLP) arise. The information that our model will use to predict our open vs. price objective variable come from newspaper headlines as mentioned in the introduction. It is prudent to present 2 new concepts here. Text-Vectorizers and N-grams.

A Text-Vectorizer is nothing more than a way to vectorize text, worth the redundancy, that is to say, to assign to each word or text a vector in some space. There are many ways to do this, some techniques assign a one-hot vector for each word in your vocabulary, a fancy name to all the words in your Corpus. This is a vector of zeros and a single one in the corresponding position of the index of the word in your vocabulary.

Some, other ways of vectorizing text assign a vector to each word that encodes context in particular geometries of a multidimensional space, such as the Word2Vec space. In our particular case, a vectorizer was used that counts the occurrences of each word and every 2-words in the news for 12 hours a day, prior to the opening of the Bitcoin market.

To understand better 1-word and 2-words we can check the concept of N-grams. Let’s look at an example with the next sentence:

“Bitcoin is a very volatile but very effective market, long live Bitcoin”

In the previous sentence all 1-grams would make up the list:

  • Bitcoin
  • is
  • a
  • very
  • effective
  • etc.

One entry per word. The list of 2-grams or as we mentioned, 2-words would be:

  • Bitcoin is
  • is a
  • a very
  • very volatile
  • etc.

These 2-words serve to extract information from pairs of words that often go together, such as New York. Now that we understand, suppose that the sentence previously written was the only headline for 12 hours. To this observation along with the price of open, price and our objective variable. A vector would be added where the occurrences of each 1-word and 2-word were counted. That is to say, it would be worth 1 in the corresponding cell for columnmercadobecause it was mentioned only once and 2 in the bitcoin and verycolumns as they were mentioned 2 times and so for all the n-gram counts in the vocabulary.

To better understand our data, let’s see the following graph with 2 axes:

Figure 1. Interesting words, open, price and our target variable.

As you can see in figure 1, on each date on which there is an observation of the values ​​of the Bitcoin market there are also the number of occurrences of words like elon and others. In other words, how many times the word came up in the headings of 12 hours before the observation of each open value. It is difficult to extract information only from this graph, besides that our dataset has occurrences of 2960 words and 2-words!

It is important to mention that the preprocessing done to extract such a clean data set is not simple, luckily there are tools like OpenBlender that facilitate all this work. Obtaining a set of headlines data from other news sources blended with numerical values ​​from more markets and with 12-hour windows or other configurations can be achieved with a single call to its API, this call is intended for R and Python users so they can train NLP models or other machine learners in a very fluid way.

Data and target variables directly to your DataFrame in this case a blend of BTC to USD and news.

Let’s see the API calls:

import OpenBlender
import pandas as pd
import json
action = 'API_createTextVectorizer'vectorizer_parameters = {
'token' : 'your_token',
'id_user' : 'your_id_user',
'name' : 'News Headlines',
'anchor':{'id_dataset' : '5d571f9e9516293a12ad4f6d', 'include_features' : ['title']},
'ngram_range' : {'min' : 1, 'max' : 2},
'language' : 'en',
'remove_stop_words' : 'on',
'min_count_limit' : 2
}
res = OpenBlender.call(action, vectorizer_parameters)

This call creates the Text-Vectorizer from news headlines that will later be blended with the bitcoin market data, let see its most important parameters:

  • anchor: The id of the news dataset and the name of the features to include as source (in this case only ‘title’)
  • ngram_range: The min and max length of the set of words which will be tokenized
  • remove_stop_words: So it eliminates stop-words from the source

Next, API call brings the headline information blended with the bitcoin market directly into the DataFrame with pullObservationsToDF() function:

parameters = { 
‘token’:’your_token’,
‘id_user’:’your_user_id',
‘id_dataset’:’5d4c3af79516290b01c83f51',
target_threshold’ : {‘feature’:’change’,’success_thr_over’:0},
lag_target_feature’ : {‘feature’:’change_over_0', ‘periods’:1},
‘blends’:[{‘id_blend’:’5de020789516293a833f5818',
blend_type’ : ‘text_ts’,
‘restriction’ : ‘predictive’,
‘blend_class’ : ‘closest_observation’,
‘specifications’:{‘time_interval_size’ : 3600*12 }}],
‘date_filter’:{‘start_date’:’2019–08–20T16:59:35.825Z’,
‘end_date’:’2019–11–04T17:59:35.825Z’},
‘drop_non_numeric’ : 1
}
def pullObservationsToDF(parameters):
action = 'API_getObservationsFromDataset'
df = pd.read_json(json.dumps(OpenBlender.call(action,parameters)['sample']) ,convert_dates=False,convert_axes=False).sort_values('timestamp', ascending=False)
df.reset_index(drop=True, inplace=True)
return df
df = pullObservationsToDF(parameters)

This magic call is preparing the target variable in the target_threshold dictionary key, and lag_target_feature is lagging it one day, let's see other interesting parameters:

  • id_blend: The id from the TextVectorizer, that was an output in the first API call.
  • blend_type: ‘text_ts’ so it knows it’s a text and timestamp blend.
  • specifications: the maximum time to the past from which it will bring observations in seconds which in this case is 12 hours (3600*12). This only means that every Bitcoin price observation will be predicted with news from the past 12 hours.

OpenBlender flexibility allows you to do this kind of blending from plenty of newspapers, stock markets as well as other spatial or time series data sets.

To continue the exploration of our data, let’s see what are our headlines talking about by showing the words with more occurrences among all our 68 observations:

Figure 2. Most mentioned words and 2-words throughout our sample.

It is clear the bias that our dataset is talking about trump, chinaand other current popular issues, however, this is not very informative as to how they affect the Bitcoin market. To get a better insight, let's see the n-grams most and least correlated with our objective variable.

Figure 3. N-grams most and least correlated with the positive change between the opening and closing of bitcoin.
  • Can the impeachment affect the price of the dollar and thus encourage the purchase of bitcoin?
  • Elon Musk has large investments in Bitcoin, does the success or failure of his projects move bitcoin?

These and other narratives can be generated from a deeper NLP analysis out of the scope of this article, we shall continue improving our understanding of our data by a principal component analysis (PCA), which reduces the dimensionality of our observations of more than 2900 variables to 2 projecting them into what could be the best Photo of our observations. Or explained in another way, 2 orthogonal axes that maximize the total variance of the data:

Figure 4. Our database in its first 2 main components.

In this projection it seems complicated to find some cut that allows us to segment our observations to generate an adequate forecast of our target variable, however, these 2 main components do not explain even 10% of the total variance of our observations, especially because of the high dimensionality of each of the observations. Luckily, our correlations in Figure 3 keep us enthusiastic that we will be able to extract information from headings to develop a good prediction.

Predictive Models

Remember that the variable we are interested in predicting is Positive change (open vs. price), that is, 2 classes:

  • 1 if the difference between price and open was positive.
  • 0 if it was negative.

This binary classification problem was attacked with the following models:

  • LDA
  • QDA
  • Logistic regression
  • Random Forest

These models used the first 70% of the data for training and the last 30% to test them in date order.

Among the models mentioned above, LDA and Logistic Regression are linear models, while QDA and Random Forest are not. LDA and QDA refer to linear and quadratic discriminant analysis respectively. These 2 models seek to find a linear or quadratic function that maximizes the distance between the groups generated by the objective variable, while simultaneously minimizing the density in each group.

The problem with some of these models is the difficulty that exists in interpreting the results they generate, especially in the case of the Logistic Regression and the random forest. For example, Random Forest is a model that generates random decision trees that are adjusted in such a way that they maximize the prediction results.

All these models were evaluated and surprisingly LDA, the simplest model of them all, proved to be the best in predicting our objective variable with 63% accuracy. It is worth mentioning that the rest of the models generated less precision but all of them were better than random. Enough evidence to support our working hypothesis. Let’s now look at the most important correlations between our projected observations to the linear discriminant function and the n-grams:

Figure 5. Correlations between n-grams and the projected observations in the discriminant function.

These correlations can be interpreted as follows, a high correlation increases the value in the discriminant function and a low correlation decreases it.

This discriminant function aims to separate our labels given by the objective variable dividing them with a cutting point, that is, if 12 hours of news fall in the top side of the discriminant function we predict that the change between price and open will be positive, while if the value is smaller than the cutting point, we will predict that the change will be negative.

To better understand this prediction let’s look at the following territorial map where we can visualize this behaviour:

Figure 6. Territorial map generated from the fitted LDA model.

Each of the observations, that is, the occurrences of N-grams in 12 hours of headings is projected to this discriminant function defined on the vertical axis. The cutting point that separates the territories is the one that allows us to generate our prediction, that is the bottom territory will predict that the change will be negative, while the top one the opposite. 100% accuracy would be that all the blue and black observations fall in the lower territory and the orange and red in the upper one. In spite of the errors in our prediction, it is still much better than the one that a coin toss would give, see the confusion matrix output by our model:

Figure 7. Confusion matrix generated from the predictions obtained by the LDA model.

This graph shows the successes and errors in the prediction that our model had, in 4 occasions the model predicted a rise and in fact, it did not happen and in 3 occasions we failed to predict a decline. 8 times it was predicted that the price was going to go down and indeed it went down, while 4 times it was predicted that the price would go up and went up. These 12 successes, which coincide with the trace of the matrix, give the 63% accuracy.

Portfolio simulation

As a first step, we observe if the price of Bitcoin will increase and decrease in the next day, additionally we observe which will be the evolution of a portfolio with a starting capital of USD $ 100,000. In order to determine this evolution, we rely on a dynamic portfolio, that is, this portfolio changes position daily as the prediction on the movement of Bitcoin is updated before the time of market opening.

Our dynamic portfolio works as follows. The position, long or short, is defined as follows:

Where a position of 1 represents a long position in Bitcoin, that is, we buy Bitcoin in order to sell them at the closing price. If our prediction is correct and the price of Bitcoin increases this represents a gain for the investor as it will sell at a price above the purchase price. On the other hand, a position of -1 represents a short position in Bitcoin, that is, we borrow the Bitcoin and sell it at the moment, which we owe Bitcoin to another investor. At the end of the day at the close of the traditional markets we must buy the Bitcoin to return them to the original owner, therefore, if our prediction is correct then this represents a gain since we sell at a price above the buyback price.

Let us also see that for the evolution of capital it is important that the prediction is correct and there is no difference between type I and type II errors, say false positives and false negatives, since in both cases we lose money and in the context of the problem none is worse than another.

Now, the evolution of capital will change as follows:

If the yield is negative and our prediction is correct then we take a short position in the action, which represents a capital gain. Let’s see that the only way in which we have a loss in capital is that our prediction is incorrect, since in this way the yield and position multiplication have opposite signs and therefore we have total negative performance. Here we see that yield is a primary part of the analysis. If we correctly predict the days of low price variability, i.e. days of low yield and incorrectly predict the days of high variability in price, i.e. days of high yield in absolute value, we can incur in great losses even if the level of accuracy of the model is greater than 50%.

Additionally, it is important to see what assumptions we take to be able to simulate the portfolios. Recall that the model was trained with 70 % of the data and this same model predicts each observation in 30 % of the remaining data. The assumptions are as follows:

  1. Perfect markets: The price reflected in the markets has all the public information, that is, no investor has privileged information that can represent irregular profits. Additionally, the market has liquidity by always having suppliers and claimants. This assumption is important since the simulation considers that there is always someone who wants to sell and buy at the time of opening of the markets, where we change our position daily.
  2. Short positions can be taken without limit: short positions are limited by two factors, first the availability to find someone who has Bitcoins and wants to lend them to the investor.
  3. Zero cost for transactions: In real life, as explained above, when exchanging Bitcoin the investor would have to pay a commission on the transaction.
  4. Leverage of 1 for both positions: To be able to take short positions in exchange the investor requires a level of leverage, this is defined as the amount of USD lent in Bitcoins per unit of Capital that it has. For this model we consider a level of leverage of 1 and that this is not a limitation for short positions, complementing assumption 2.

Visualize the performance of our portfolio in the next graph:

Figure 8. Evolution of our portfolio together with the bitcoin market and our predictions.

We see that the evolution of capital is increasing, although we have incorrect predictions where there are nodes where capital is lost. After the 18 predicted observations, which represents almost a calendar month the portfolio had a yield of around 30 %, upon annualization. We saw earlier that the best model in terms of prediction was LDA, which had an accuracy of 63%. The previous model predicts more than 50 %of the test observations, however, this does not mean that we can ensure positive returns. Remember, the evolution of the Capital depends on the correct prediction and the daily yield, if the number of predictions increases but the errors are caused when the yield is greater, in absolute value, then the evolution of the capital could become diminishing, that is, have negative monthly and annual returns.

Next steps

We believe that trying to measure the relationship between the movements of this highly volatile cryptocurrency and the public information of the media can be improved. Therefore, we suggest these possible steps to follow:

  1. Including more information by adding other media such as Bloomberg, etc. The objective of this is to include means that are not necessarily business and obtain prediction models with more and better information to better fit the prediction models.
  2. Finetune the model through the use of time series. Time series are of great help for predicting the returns of different financial assets, for example, ARMA models could increase the level of accuracy.
  3. Dynamic algorithm that is retrained every day with more information and can make investment decisions in real time. It is important that the model is retrained as new information becomes available, so the prediction model could improve its accuracy daily. Additionally, real-time decisions can help the model become better in terms of portfolio yields, because if at some point a story is published that contains words correlated with the price of Bitcoin a position could be taken immediately.
  4. Dynamic portfolio with risk analysis. Finally, it would be important to take into account the risk of predictions in order to reduce the risk of the dynamic portfolio. For this, it is necessary to continue working on a probability function to know how safe our prediction is, implementing dynamic allocation of capital covers the investor of adverse returns of great magnitude.

Conclusions

Let’s recap what we saw before through our research. First, we conclude that our working hypothesis is confirmed, since there is a relationship between the words that appear in the headlines and the Bitcoin movement. This relationship exists and allows predictions with limited certainty. We also note that there are some words that correlate more with Bitcoin movement and that in some cases they have a financial interpretation about why their correlation has a specific sign.

The best model was the Linear Discriminant Analysis or LDA model with an accuracy level of 63 %. Additionally, we tried other prediction models, the first of these was through the use of Principal Component Analysis or PCA, but the team decided that the results of this model were not significant since the first two main components explained barely 10.5 % of the variability of the data.

Finally, we saw that using these predictions we can see the evolution of a portfolio with a capital of USD $ 100,000 and that this has a yield of 30 % in the calendar month of prediction. The above speaks of the possible applications of predictive models in finance and that the relationships between these variables can be monetized.

Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without seeking professional advice. See our Reader Terms for details.

Thanks to Federico Riveroll as this study was based on ideas from his tutorial, find the Jupyter notebook with the code here.

--

--