Predict Any Cryptocurrency Applying NLP using Global News

A step-by-step tutorial using Python.

Published in

Towards Data Science

8 min readSep 30, 2020

Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without seeking professional advice. See our Reader Terms for details.

On today’s harsh global economic conditions, traditional indicators and techniques can have poor performances (to say the least).

In this tutorial we’ll search for useful information on news and transform it to a numerical format using NLP to train a Machine Learning model which will predict the rise or fall of any given Cryptocurrency (using Python).

Prerequisites

Have Python 3.1+ installed
Install pandas, sklearn and openblender (with pip)

$ pip install pandas OpenBlender scikit-learn

Step 1. Get the Data

We can use any cryptocurrency. For this example, let’s use this Bitcoin dataset.

Let’s pull the daily price candles from the beginning of 2020.

import pandas as pd
import numpy as np
import OpenBlender
import jsontoken = 'YOUR_TOKEN_HERE'action = 'API_getObservationsFromDataset'# ANCHOR: 'Bitcoin vs USD'
  
parameters = { 
    'token' : token,
    'id_dataset' : '5d4c3af79516290b01c83f51',
    'date_filter':{"start_date" : "2020-01-01",
                   "end_date" : "2020-08-29"} 
}df = pd.read_json(json.dumps(OpenBlender.call(action, parameters)['sample']), convert_dates=False, convert_axes=False).sort_values('timestamp', ascending=False)df.reset_index(drop=True, inplace=True)
df['date'] = [OpenBlender.unixToDate(ts, timezone = 'GMT') for ts in df.timestamp]
df = df.drop('timestamp', axis = 1)

Note: You need to create an account on openblender.io (free) and add your token (you’ll find it in the ‘Account’ section).

Let’s take a look.

print(df.shape)
df.head()

We have 254 daily observations of Bitcoin prices since the beginning of the year.

*Note: The same pipeline which will be applied to these 24 hour candles can be applied to any size of candle (even per second).

Step 2. Define and Understand our Target

On our Bitcoin data, we have a column ‘price’ with the closing price of the day and ‘open’ with the opening price of the day.

We want to get the percentage difference from the closing price with respect to the opening price so we have a variable with that day’s performance.

To get this variable we will calculate the logarithmic difference between the close and open price.

df['log_diff'] = np.log(df['price']) - np.log(df['open'])
df

The ‘log_diff’ can be thought of as percentage change as it approximates it, for the purpose of this tutorial, they are practically equivalent.

(We can see the very high correlation with ‘change’)

Let’s take a look.

We can see a bearish behaviour through the year and a stable variation between -0.1 and 0.1 change (minding a violent outlier).

Now, let’s generate our target variable by setting “1” if the performance was positive (log_diff > 0) and “0” if else.

df['target'] = [1 if log_diff > 0 else 0 for log_diff in df['log_diff']]df

Simply put, our target will be to predict if the performance will be positive or not the following day (so we can make a potential trading decision).

Step 3. Get News Data with our Own

Now, we want to time-blend external data with our Bitcoin data. Simply put, this means to outer join another dataset using the timestamp as key.

We can do this very easily with the OpenBlender API, but first we need to create a Unix Timestamp variable.

The Unix Timestamp is the number of seconds since 1970 on UTC, it is a very convenient format because it is the same in every time zone in the world!

format = '%d-%m-%Y %H:%M:%S'
timezone = 'GMT'df['u_timestamp'] = OpenBlender.dateToUnix(df['date'], 
                                           date_format = format, 
                                           timezone = timezone)df = df[['date', 'timestamp', 'price', 'target']]
df.head()

Now, let’s search for useful datasets that are time-intersected with ours.

search_keyword = 'bitcoin'df = df.sort_values('timestamp').reset_index(drop = True)print('From : ' + OpenBlender.unixToDate(min(df.timestamp)))
print('Until: ' + OpenBlender.unixToDate(max(df.timestamp)))OpenBlender.searchTimeBlends(token,
                             df.timestamp,
                             search_keyword)

We retrieved several datasets on a list with their names, description, url to the interface and even the features, but more importantly, the percentage of time overlap (intersection) with our dataset.

After browsing through some, this one about Bitcoin News and latest threads looks interesting.

Now, from the above dataset, we are only interested in the ‘TEXT’ feature which contains the news. So let’s blend the news for the past 24 hours.

# We need to add the 'id_dataset' and the 'feature' name we want.blend_source = {
                'id_dataset':'5ea2039095162936337156c9',
                'feature' : 'text'
            }
# Now, let's 'timeBlend' it to our datasetdf_blend = OpenBlender.timeBlend( token = token,
                                  anchor_ts = df.timestamp,
                                  blend_source = blend_source,
                                  blend_type = 'agg_in_intervals',
                                  interval_size = 60 * 60 * 24,
                                  direction = 'time_prior',
                                  interval_output = 'list',
                                  missing_values = 'raw')df = pd.concat([df, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)df.head()

The parameters for the Time Blend:

anchor_ts: We only need to send our timestamp column so that it can be used as an anchor to blend the external data.
blend_source: The information about the feature we want.
blend_type: ‘agg_in_intervals’ because we want 24 hour interval aggregation to each of our observations.
inverval_size: The size of the interval in seconds (24 hours in this case).
direction: ‘time_prior’ because we want the interval to gather observations from the prior 24 hours and not forward.

And now we have our same dataset but with 2 added columns. One with a list of the texts gathered within the 24 hour interval (‘last1days’) and the other with the count.

Now let’s be a bit more specific, and let’s try to gather ‘positive’ and ‘negative’ news with a filter adding some ngrams (off the top of my head).

# We add the ngrams to match on a 'positive' feature.
positive_filter = {'name' : 'positive', 
                   'match_ngrams': ['positive', 'buy', 
                                    'bull', 'boost']}blend_source = {
                'id_dataset':'5ea2039095162936337156c9',
                'feature' : 'text',
                'filter_text' : positive_filter
            }df_blend = OpenBlender.timeBlend( token = token,
                                  anchor_ts = df.timestamp,
                                  blend_source = blend_source,
                                  blend_type = 'agg_in_intervals',
                                  interval_size = 60 * 60 * 24,
                                  direction = 'time_prior',
                                  interval_output = 'list',
                                  missing_values = 'raw')df = pd.concat([df, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)
# And now the negatives
negative_filter = {'name' : 'negative', 
                   'match_ngrams': ['negative', 'loss', 'drop', 'plummet', 'sell', 'fundraising']}blend_source = {
                'id_dataset':'5ea2039095162936337156c9',
                'feature' : 'text',
                'filter_text' : negative_filter
            }df_blend = OpenBlender.timeBlend( token = token,
                                  anchor_ts = df.timestamp,
                                  blend_source = blend_source,
                                  blend_type = 'agg_in_intervals', #closest_observation
                                  interval_size = 60 * 60 * 24,
                                  direction = 'time_prior',
                                  interval_output = 'list',
                                  missing_values = 'raw')df = pd.concat([df, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)

Now we have 4 new columns, the number and list of ‘positive’ and ‘negative’ news.

Let’s take a look at the correlation between the target and the other numerical features.

features = ['target', 'BITCOIN_NE.text_COUNT_last1days:positive', 'BITCOIN_NE.text_COUNT_last1days:negative']df_anchor[features].corr()['target']

We can notice slight negative and positive correlations with our new generated features respectively.

Now let’s use a TextVectorizer to get a lot of auto generated token features.

I created this TextVectorizer on OpenBlender for the ‘text’ feature on the BTC News dataset with over 1200 ngrams.

Lets blend those features with ours. We can use the exact same code, we only need to pass the ‘id_textVectorizer’ on the blend_source.

# BTC Text Vectorizer
blend_source = { 
                'id_textVectorizer':'5f739fe7951629649472e167'
               }df_blend = OpenBlender.timeBlend( token = token,
                                  anchor_ts = df_anchor.timestamp,
                                  blend_source = blend_source,
                                  blend_type = 'agg_in_intervals',
                                  interval_size = 60 * 60 * 24,
                                  direction = 'time_prior',
                                  interval_output = 'list',
                                  missing_values = 'raw') .add_prefix('VEC.')df_anchor = pd.concat([df_anchor, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)
df_anchor.head()

Now we have a 1229 column dataset with binary features of ngram occurrences on the aggregated news of each interval, aligned with our target.

Step 4. Apply ML and View Results

Now, let’s apply some simple ML to view some results.

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score# We drop correlated features because with so many binary 
# ngram variables there's a lot of noisecorr_matrix = df_anchor.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
df_anchor.drop([column for column in upper.columns if any(upper[column] > 0.5)], axis=1, inplace=True)
# Now we separate in train/test setsX = df_.loc[:, df_.columns != 'target'].select_dtypes(include=[np.number]).drop(drop_cols, axis = 1).values
y = df_.loc[:,['target']].values
div = int(round(len(X) * 0.2))
X_train = X[:div]
y_train = y[:div]
X_test = X[div:]
y_test = y[div:]# Finally, we perform ML and see resultsrf = RandomForestRegressor(n_estimators = 1000, random_state=0)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
df_res = pd.DataFrame({'y_test':y_test[:, 0], 'y_pred':y_pred})threshold = 0.5
preds = [1 if val > threshold else 0 for val in df_res['y_pred']]
print(metrics.confusion_matrix(preds, df_res['y_test']))
print('Accuracy Score:')
print(accuracy_score(preds, df_res['y_test']))
print('Precision Score:')
print(precision_score(preds, df_res['y_test']))

While the overall Accuracy isn’t impressive, we are particularly interested in the ‘Precision Score’ as our goal is to detect which following days will most likely have an uprise (and avoid downfalls).

How we read the above confusion matrix is:

Our model predicted ‘uprise’ 102 times, from those, 84 were actual uprises and 17 were not (0.83 Precision Score).
In total, there were 157 uprises. Our model detected 84 of them and missed 73.
In total, there were 32 ‘downfall’ (or simply not ‘uprise’) cases, our model detected 15 of them and missed 17.

In other words, if the priority is to avoid downfalls -even if it means sacrificing a lot of ‘uprise’ cases-, this model worked well in this period of time.

We can also say that if the priority was to avoid missing uprises (even if a few downfalls infiltrated), this model with this threshold would not not be a good option at all.