Clinical Trial Outcome Prediction

Part 1: Multi-modal health data embedding

Lennart Langouche
Towards Data Science

--

I recently came across this article: HINT: Hierarchical interaction network for clinical-trial-outcome predictions from Fu et al. It’s an interesting application of real-world data science, and it inspired me to create my own project in which I attempt to predict clinical trial outcomes based on publicly available information from ClinicalTrials.gov.

The aim of the project is to predict the outcome of clinical trials (a binary outcome: fail vs. success), without actually having to run them. We will use publicly available clinical trial information from ClinicalTrials.gov such as Drug Molecule, Disease Indication, Trial Protocol, Sponsor, and Number of Participants and embed it (transform it into vector representations) using different tools such as BioBERT, SBERT, and DeepPurpose.

Workflow schematic (image by author)

In the first part of this series I focus on embedding the multi-modal clinical trial data. In the second part I use an XGBoost model to predict trial outcomes (a binary prediction: fail vs. success) and briefly compare my simple XGBoost model’s performance to the HINT model’s performance from the article that inspired this project.

Focus for Part 1 of this series: Embedding multi-modal clinical trial data into vectors (image by author)

Here are the steps I will follow in this article:

You can follow all the steps in this Jupyter notebook: clinical trial embedding tutorial.

Collect clinical trial records from ClinicalTrials.gov

I suggest running the whole process in the command line since it is time- and space-consuming. In case you don’t have wget installed on your system, have a look here for how to install wget. Open up a command line/terminal and type in the following commands:

# 0. Clone repository
# Navigate to the directory where you want to clone the repository and type:
git clone https://github.com/lenlan/clinical-trial-prediction.git
cd clinical-trial-prediction

# 1. Download data
mkdir -p raw_data
cd raw_data
wget https://clinicaltrials.gov/AllPublicXML.zip # This will take 10-20 minutes to download

# 2. Unzip the ZIP file.
# The unzipped file occupies approximately 11 GB. Please make sure you have enough space.
unzip AllPublicXML.zip # This might take over an hour to run, depending on your system
cd ../

# 3. Collect and sort all the XML files and put output in all_xml
find raw_data/ -name NCT*.xml | sort > data/all_xml
head -3 data/all_xml

### Output:
# raw_data/NCT0000xxxx/NCT00000102.xml
# raw_data/NCT0000xxxx/NCT00000104.xml
# raw_data/NCT0000xxxx/NCT00000105.xml

# NCTID is the identifier of a clinical trial. `NCT00000102`, `NCT00000104`, `NCT00000105` are all NCTIDs.

# 4. Remove ZIP file to recover some disk space
rm raw_data/AllPublicXML.zip

Read and parse the obtained XML files

Now that you have the clinical trials as individual files on your hard drive, we’re going to extract the information we need from them by parsing the XML files.

from xml.etree import ElementTree as ET
# function adapted from https://github.com/futianfan/clinical-trial-outcome-prediction
def xmlfile2results(xml_file):
tree = ET.parse(xml_file)
root = tree.getroot()
nctid = root.find('id_info').find('nct_id').text ### nctid: 'NCT00000102'
print("nctid is", nctid)
study_type = root.find('study_type').text
print("study type is", study_type)
interventions = [i for i in root.findall('intervention')]
drug_interventions = [i.find('intervention_name').text for i in interventions \
if i.find('intervention_type').text=='Drug']
print("drug intervention:", drug_interventions)
### remove 'biologics',
### non-interventions
if len(drug_interventions)==0:
return (None,)

try:
status = root.find('overall_status').text
print("status:", status)
except:
status = ''

try:
why_stop = root.find('why_stopped').text
print("why stop:", why_stop)
except:
why_stop = ''

try:
phase = root.find('phase').text
print("phase:", phase)
except:
phase = ''
conditions = [i.text for i in root.findall('condition')] ### disease
print("disease", conditions)

try:
criteria = root.find('eligibility').find('criteria').find('textblock').text
print('found criteria')
except:
criteria = ''

try:
enrollment = root.find('enrollment').text
print("enrollment:", enrollment)
except:
enrollment = ''

try:
lead_sponsor = root.find('sponsors').find('lead_sponsor').find('agency').text
print("lead_sponsor:", lead_sponsor)
except:
lead_sponsor = ''

data = {'nctid':nctid,
'study_type':study_type,
'drug_interventions':[drug_interventions],
'overall_status':status,
'why_stopped':why_stop,
'phase':phase,
'indications':[conditions],
'criteria':criteria,
'enrollment':enrollment,
'lead_sponsor':lead_sponsor}
return pd.DataFrame(data)

### Output:
# nctid is NCT00040014
# study type is Interventional
# drug intervention: ['exemestane']
# status: Terminated
# phase: Phase 2
# disease ['Breast Neoplasms']
# found criteria
# enrollment: 100
# lead_sponsor: Pfizer

Using sentence-transformers to embed information — Example

First we need to install the sentence-transformers library.

pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
#all-MiniLM-L6-v2 encodes each sentence into a 312-dimensional vector
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings.shape)

### Output:
# (2, 312)

We have successfully transformed two sentences into a 312-dimensional vector representation.

Embed disease indications using tiny-biobert

We first create a dictionary, where we map each indication to its 312- dimensional embedding using tiny-biobert. Then, we create a dictionary that directly maps each trial identifier to its disease embedding. When trials include multiple indications, we take their mean as vector representation.

def create_indication2embedding_dict():
# Import toy dataset
toy_df = pd.read_pickle('data/toy_df.pkl')

# Create list with all indications and encode each one into a 312-dimensional vector
all_indications = sorted(set(reduce(lambda x, y: x + y, toy_df['indications'].tolist())))

# Using 'nlpie/tiny-biobert', a smaller version of BioBERT
model = SentenceTransformer('nlpie/tiny-biobert')
embeddings = model.encode(all_indications, show_progress_bar=True)

# Create dictionary mapping indications to embeddings
indication2embedding_dict = {}
for key, row in zip(all_indications, embeddings):
indication2embedding_dict[key] = row
pickle.dump(indication2embedding_dict, open('data/indication2embedding_dict.pkl', 'wb'))

embedding = []
for indication_lst in tqdm(toy_df['indications'].tolist()):
vec = []
for indication in indication_lst:
vec.append(indication2embedding_dict[indication])
print(np.array(vec).shape) # DEBUG
vec = np.mean(np.array(vec), axis=0)
print(vec.shape) # DEBUG
embedding.append(vec)
print(np.array(embedding).shape)

dict = zip(toy_df['nctid'], np.array(embedding))
nctid2disease_embedding_dict = {}
for key, row in zip(toy_df['nctid'], np.array(embedding)):
nctid2disease_embedding_dict[key] = row
pickle.dump(nctid2disease_embedding_dict, open('data/nctid2disease_embedding_dict.pkl', 'wb'))

create_indication2embedding_dict()

Embed clinical trial inclusion/exclusion criteria using tiny-biobert

In a very similar way, we encode clinical trial inclusion/exclusion criteria. There is some additional data cleaning involved to get the text into the right format. We encode inclusion and exclusion criteria separately, and each criteria embedding is the mean vector representation of the sentences it consists of.

def create_nctid2protocol_embedding_dict():
# Import toy dataset
toy_df = pd.read_pickle('data/toy_df.pkl')

# Using 'nlpie/tiny-biobert', a smaller version of BioBERT
model = SentenceTransformer('nlpie/tiny-biobert')

def criteria2vec(criteria):
embeddings = model.encode(criteria)
# print(embeddings.shape) # DEBUG
embeddings_avg = np.mean(embeddings, axis=0)
# print(embeddings_avg.shape) # DEBUG
return embeddings_avg

nctid_2_protocol_embedding = dict()
print(f"Embedding {len(toy_df)*2} inclusion/exclusion criteria..")
for nctid, protocol in tqdm(zip(toy_df['nctid'].tolist(), toy_df['criteria'].tolist())):
# if(nctid == 'NCT00003567'): break #DEBUG
split = split_protocol(protocol)
if len(split)==2:
embedding = np.concatenate((criteria2vec(split[0]), criteria2vec(split[1])))
else:
embedding = np.concatenate((criteria2vec(split[0]), np.zeros(312)))
nctid_2_protocol_embedding[nctid] = embedding
# for key in nctid_2_protocol_embedding: #DEBUG
# print(f"{key}:{nctid_2_protocol_embedding[key].shape}") #DEBUG
pickle.dump(nctid_2_protocol_embedding, open('data/nctid_2_protocol_embedding_dict.pkl', 'wb'))
return

create_nctid2protocol_embedding_dict()

Embed sponsor information using all-MiniLM-L6-v2, a powerful pre-trained sentence encoder from SentenceBERT

I chose to encode trial sponsors using sentence embedding (SBERT) as well. Using simpler methods such as Label or One-Hot Encoding could work too, but I wanted to be able to catch similarities between sponsor names, in case there were typos or multiple different spellings for one sponsor. I use the pre-trained all-MiniLM-L6-v2 model which achieves high speed and performance on benchmark datasets. It converts each sponsor institution into a 384-dimensional vector.

def create_sponsor2embedding_dict():
# Import toy dataset
toy_df = pd.read_pickle('data/toy_df.pkl')

# Create list with all indications and encode each one into a 384-dimensional vector
all_sponsors = sorted(set(toy_df['lead_sponsor'].tolist()))

# Using 'all-MiniLM-L6-v2', a pre-trained model with excellent performance and speed
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(all_sponsors, show_progress_bar=True)
print(embeddings.shape)

# Create dictionary mapping indications to embeddings
sponsor2embedding_dict = {}
for key, row in zip(all_sponsors, embeddings):
sponsor2embedding_dict[key] = row
pickle.dump(sponsor2embedding_dict, open('data/sponsor2embedding_dict.pkl', 'wb'))

create_sponsor2embedding_dict()

Convert drug names to their SMILES representation and then to their Morgan fingerprint using DeepPurpose

Molecules can be represented in SMILES strings. SMILES is a line notation for encoding molecular structure. Drug molecule data is extracted from ClinicalTrials.gov and linked to its molecule structure (SMILES strings) using CACTUS.

import requests

def get_smiles(drug_name):
# URL for the CIR API
base_url = "https://cactus.nci.nih.gov/chemical/structure"
url = f"{base_url}/{drug_name}/smiles"

try:
# Send a GET request to retrieve the SMILES representation
response = requests.get(url)

if response.status_code == 200:
smiles = response.text.strip() # Get the SMILES string
print(f"Drug Name: {drug_name}")
print(f"SMILES: {smiles}")
else:
print(f"Failed to retrieve SMILES for {drug_name}. Status code: {response.status_code}")
smiles = ''

except requests.exceptions.RequestException as e:
print(f"An error occurred: {e}")

return smiles

# Define the drug name you want to convert
drug_name = "aspirin" # Replace with the drug name of your choice
get_smiles(drug_name)

### Output:
# Drug Name: aspirin
# SMILES: CC(=O)Oc1ccccc1C(O)=O

DeepPurpose can be used to encode molecular compounds. It currently supports 15 different encodings. We will use Morgan encoding, which encodes the atom groups of a chemical into a binary vector with length and radius as its two parameters. First we need to install the DeepPurpose library.

pip install DeepPurpose
Overview of DeepPurpose Encoders (Image by Huang et al., CC license)

We create a dictionary that maps SMILES to Morgan representation and a dictionary that maps clinical trial identifiers (NCTIDs) directly to their Morgan representation.

def create_smiles2morgan_dict():
from DeepPurpose.utils import smiles2morgan

# Import toy dataset
toy_df = pd.read_csv('data/toy_df.csv')

smiles_lst = list(map(txt_to_lst, toy_df['smiless'].tolist()))
unique_smiles = set(reduce(lambda x, y: x + y, smiles_lst))

morgan = pd.Series(list(unique_smiles)).apply(smiles2morgan)
smiles2morgan_dict = dict(zip(unique_smiles, morgan))
pickle.dump(smiles2morgan_dict, open('data/smiles2morgan_dict.pkl', 'wb'))

def create_nctid2molecule_embedding_dict():
# Import toy dataset
toy_df = pd.read_csv('data/toy_df.csv')
smiles_lst = list(map(txt_to_lst, toy_df['smiless'].tolist()))
smiles2morgan_dict = load_smiles2morgan_dict()

embedding = []
for drugs in tqdm(smiles_lst):
vec = []
for drug in drugs:
vec.append(smiles2morgan_dict[drug])
# print(np.array(vec).shape) # DEBUG
vec = np.mean(np.array(vec), axis=0)
# print(vec.shape) # DEBUG
embedding.append(vec)
print(np.array(embedding).shape)

dict = zip(toy_df['nctid'], np.array(embedding))
nctid2molecule_embedding_dict = {}
for key, row in zip(toy_df['nctid'], np.array(embedding)):
nctid2molecule_embedding_dict[key] = row
pickle.dump(nctid2molecule_embedding_dict, open('data/nctid2molecule_embedding_dict.pkl', 'wb'))

create_nctid2molecule_embedding_dict()

Conclusion

Using publicly available clinical trial information we can create useful inputs for machine learning models by using feature embedding. To summarize, we:

In the second part of this series I will run a simple XGBoost model to predict the clinical trial outcome based on the embedded vector representations we created here. I will compare its performance to that of the HINT model.

References

  • Fu, Tianfan, et al. “Hint: Hierarchical interaction network for clinical-trial-outcome predictions.” Patterns 3.4 (2022).
  • Huang, Kexin, et al. “DeepPurpose: a deep learning library for drug–target interaction prediction.” Bioinformatics 36.22–23 (2020): 5545–5547.

--

--

AI/ML Biomedical Nanotechnology PhD. Passionate about the intersection of AI and the life sciences.