
CheMBL is a drug discovery Database of bioactive molecules. It brings together chemical, bioactivity, and genomics information with the goal of aiding the translation of genomic information into effective new drugs. In this post, we will perform exploratory data analysis on a sample of data from the CheMBL database.
To access the data, you need to create a Google Cloud account, install Google BigQuery and generate authentication credentials. The instructions for doing so can be found here.
Let’s get started!
First, let’s import the necessary google authentication and BigQuery packages:
from google.oauth2 import service_account
from google.cloud import bigquery
Next, we need to import pandas and define the column settings to be ‘None’. We do this so that we can display all columns since pandas truncate them by default.
import pandas as pd
pd.set_option("display.max_columns", None)
Next, we define our key path, client credentials, and client object:
key_path = "key_path/key.json"
credentials = service_account.Credentials.from_service_account_file(
key_path,
scopes=["https://www.googleapis.com/auth/cloud-platform"],
)
client = bigquery.Client(
credentials=credentials,
project=credentials.project_id,
)
Next, we define the dataset reference where we pass in the name of the database, ‘ebi_chembl’, and the name of the project, ‘patents-public-data’:
dataset_ref = client.dataset("ebi_chembl", project="patents-public-data")
We can look at the tables available. There are 229 tables but we will look at the names of the first 10:
columns = [x.table_id for x in client.list_tables(dataset_ref)][:10]
print(columns)

Let’s look at the first element in the list, ‘action_type’:
table_ref = dataset_ref.table('action_type')
table = client.get_table(table_ref)
We can store the results in a dataframe and print the first five rows:
df = client.list_rows(table).to_dataframe()
print(df.head())

We can import the counter method from the collections module in Python to analyze the distribution in action types:
from collections import Counter
print(Counter(df['action_type'].values))

Let’s look at the next ten tables to see if there is anything with a bit more detail we can pull:
columns = [x.table_id for x in client.list_tables(dataset_ref)][10:20]
print(columns)

Let’s take a look at the ‘assay_classification’ table:
table_ref = dataset_ref.table('assay_classification')
table = client.get_table(table_ref)
df = client.list_rows(table).to_dataframe()
print(df.head())

Let’s look at the most common 3 values for ‘l1’ using counter:
from collections import Counter
print(Counter(df['l1'].values).most_common(3))

And for ‘l2’:
from collections import Counter
print(Counter(df['l2'].values).most_common(3))

We can also visualize some of this output. Let’s define a function that takes a category string as input and displays a bar chart of the three most common values:
def plot_most_common(category):
bar_plot = dict(Counter(df[category].values).most_common(3))
plt.bar(*zip(*bar_plot.items()))
plt.show()
Let’s now call the function with ‘l1’:
plot_most_common('l1')

And ‘l2’:
plot_most_common('l2')

In this post, we discussed how to access the European Bioinformatics Institute CheMBL database using Google BigQuery. The database contains chemical, bioactivity, and genomic information which can aid small molecule drug discovery. We performed some basic analysis of the action_type and assay_classification tables and generated some visualizations. The code from this post is available on GitHub. Thank you for reading!