Clinical Trial Outcome Prediction

Part 1: Multi-modal health data embedding

Published in

Towards Data Science

8 min readOct 4, 2023

I recently came across this article: HINT: Hierarchical interaction network for clinical-trial-outcome predictions from Fu et al. It’s an interesting application of real-world data science, and it inspired me to create my own project in which I attempt to predict clinical trial outcomes based on publicly available information from ClinicalTrials.gov.

The aim of the project is to predict the outcome of clinical trials (a binary outcome: fail vs. success), without actually having to run them. We will use publicly available clinical trial information from ClinicalTrials.gov such as Drug Molecule, Disease Indication, Trial Protocol, Sponsor, and Number of Participants and embed it (transform it into vector representations) using different tools such as BioBERT, SBERT, and DeepPurpose.

In the first part of this series I focus on embedding the multi-modal clinical trial data. In the second part I use an XGBoost model to predict trial outcomes (a binary prediction: fail vs. success) and briefly compare my simple XGBoost model’s performance to the HINT model’s performance from the article that inspired this project.

Focus for Part 1 of this series: Embedding multi-modal clinical trial data into vectors (image by author)

Here are the steps I will follow in this article:

Collect all clinical trial records from ClinicalTrials.gov
Read and parse the obtained XML files
Embed disease indications using tiny-biobert by Rohanian et al., a compact version of BioBERT
Embed clinical trial inclusion-/exclusion criteria using tiny-biobert
Embed sponsor information using all-MiniLM-L6-v2, a powerful pre-trained sentence encoder from SentenceBERT
Convert drug names to their SMILES representation and then to their Morgan fingerprint using DeepPurpose by Huang et al.

You can follow all the steps in this Jupyter notebook: clinical trial embedding tutorial.

Collect clinical trial records from ClinicalTrials.gov

I suggest running the whole process in the command line since it is time- and space-consuming. In case you don’t have wget installed on your system, have a look here for how to install wget. Open up a command line/terminal and type in the following commands:

# 0. Clone repository
# Navigate to the directory where you want to clone the repository and type:
git clone https://github.com/lenlan/clinical-trial-prediction.git
cd clinical-trial-prediction

# 1. Download data
mkdir -p raw_data
cd raw_data
wget https://clinicaltrials.gov/AllPublicXML.zip # This will take 10-20 minutes to download

# 2. Unzip the ZIP file.
# The unzipped file occupies approximately 11 GB. Please make sure you have enough space. 
unzip AllPublicXML.zip # This might take over an hour to run, depending on your system
cd ../

# 3. Collect and sort all the XML files and put output in all_xml
find raw_data/ -name NCT*.xml | sort > data/all_xml
head -3 data/all_xml

### Output:
# raw_data/NCT0000xxxx/NCT00000102.xml
# raw_data/NCT0000xxxx/NCT00000104.xml
# raw_data/NCT0000xxxx/NCT00000105.xml

# NCTID is the identifier of a clinical trial. `NCT00000102`, `NCT00000104`, `NCT00000105` are all NCTIDs. 

# 4. Remove ZIP file to recover some disk space
rm raw_data/AllPublicXML.zip

Read and parse the obtained XML files

Now that you have the clinical trials as individual files on your hard drive, we’re going to extract the information we need from them by parsing the XML files.

from xml.etree import ElementTree as ET
# function adapted from https://github.com/futianfan/clinical-trial-outcome-prediction
def xmlfile2results(xml_file):
    tree = ET.parse(xml_file)
    root = tree.getroot()
    nctid = root.find('id_info').find('nct_id').text ### nctid: 'NCT00000102'
    print("nctid is", nctid)
    study_type = root.find('study_type').text
    print("study type is", study_type)
    interventions = [i for i in root.findall('intervention')]
    drug_interventions = [i.find('intervention_name').text for i in interventions \
              if i.find('intervention_type').text=='Drug']
    print("drug intervention:", drug_interventions)
    ### remove 'biologics', 
    ### non-interventions 
    if len(drug_interventions)==0:
        return (None,)

    try:
        status = root.find('overall_status').text 
        print("status:", status)
    except:
        status = ''

    try:
        why_stop = root.find('why_stopped').text
        print("why stop:", why_stop)
    except:
        why_stop = ''

    try:
        phase = root.find('phase').text
        print("phase:", phase)
    except:
        phase = ''
    conditions = [i.text for i in root.findall('condition')] ### disease 
    print("disease", conditions)

    try:
        criteria = root.find('eligibility').find('criteria').find('textblock').text
        print('found criteria')
    except:
        criteria = ''

    try:
        enrollment = root.find('enrollment').text
        print("enrollment:", enrollment)
    except:
        enrollment = ''

    try:
        lead_sponsor = root.find('sponsors').find('lead_sponsor').find('agency').text 
        print("lead_sponsor:", lead_sponsor)
    except:
        lead_sponsor = ''

    data = {'nctid':nctid,
           'study_type':study_type,
           'drug_interventions':[drug_interventions],
           'overall_status':status,
           'why_stopped':why_stop,
           'phase':phase,
           'indications':[conditions],
           'criteria':criteria,
           'enrollment':enrollment,
           'lead_sponsor':lead_sponsor}
    return pd.DataFrame(data)

### Output:
# nctid is NCT00040014
# study type is Interventional
# drug intervention: ['exemestane']
# status: Terminated
# phase: Phase 2
# disease ['Breast Neoplasms']
# found criteria
# enrollment: 100
# lead_sponsor: Pfizer

Using sentence-transformers to embed information — Example

First we need to install the sentence-transformers library.

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
#all-MiniLM-L6-v2 encodes each sentence into a 312-dimensional vector
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings.shape)

### Output:
# (2, 312)

We have successfully transformed two sentences into a 312-dimensional vector representation.

Embed disease indications using tiny-biobert

We first create a dictionary, where we map each indication to its 312- dimensional embedding using tiny-biobert. Then, we create a dictionary that directly maps each trial identifier to its disease embedding. When trials include multiple indications, we take their mean as vector representation.

def create_indication2embedding_dict():
    # Import toy dataset
    toy_df = pd.read_pickle('data/toy_df.pkl')

    # Create list with all indications and encode each one into a 312-dimensional vector
    all_indications = sorted(set(reduce(lambda x, y: x + y, toy_df['indications'].tolist())))     

    # Using 'nlpie/tiny-biobert', a smaller version of BioBERT
    model = SentenceTransformer('nlpie/tiny-biobert')
    embeddings = model.encode(all_indications, show_progress_bar=True)

    # Create dictionary mapping indications to embeddings
    indication2embedding_dict = {}
    for key, row in zip(all_indications, embeddings):
        indication2embedding_dict[key] = row
    pickle.dump(indication2embedding_dict, open('data/indication2embedding_dict.pkl', 'wb')) 
        
    embedding = []
    for indication_lst in tqdm(toy_df['indications'].tolist()):
        vec = []
        for indication in indication_lst:
            vec.append(indication2embedding_dict[indication])
        print(np.array(vec).shape) # DEBUG
        vec = np.mean(np.array(vec), axis=0)
        print(vec.shape) # DEBUG
        embedding.append(vec)
    print(np.array(embedding).shape)
    
    dict = zip(toy_df['nctid'], np.array(embedding))
    nctid2disease_embedding_dict = {}
    for key, row in zip(toy_df['nctid'], np.array(embedding)):
        nctid2disease_embedding_dict[key] = row
    pickle.dump(nctid2disease_embedding_dict, open('data/nctid2disease_embedding_dict.pkl', 'wb'))  
    
create_indication2embedding_dict()

Embed clinical trial inclusion/exclusion criteria using tiny-biobert

In a very similar way, we encode clinical trial inclusion/exclusion criteria. There is some additional data cleaning involved to get the text into the right format. We encode inclusion and exclusion criteria separately, and each criteria embedding is the mean vector representation of the sentences it consists of.

def create_nctid2protocol_embedding_dict():
     # Import toy dataset
    toy_df = pd.read_pickle('data/toy_df.pkl')
    
    # Using 'nlpie/tiny-biobert', a smaller version of BioBERT
    model = SentenceTransformer('nlpie/tiny-biobert')
    
    def criteria2vec(criteria):
        embeddings = model.encode(criteria)
#         print(embeddings.shape) # DEBUG
        embeddings_avg = np.mean(embeddings, axis=0)
#         print(embeddings_avg.shape) # DEBUG
        return embeddings_avg
    
    nctid_2_protocol_embedding = dict()
    print(f"Embedding {len(toy_df)*2} inclusion/exclusion criteria..")
    for nctid, protocol in tqdm(zip(toy_df['nctid'].tolist(), toy_df['criteria'].tolist())):    
#         if(nctid == 'NCT00003567'): break #DEBUG
        split = split_protocol(protocol)
        if len(split)==2:
            embedding = np.concatenate((criteria2vec(split[0]), criteria2vec(split[1])))
        else: 
            embedding = np.concatenate((criteria2vec(split[0]), np.zeros(312)))
        nctid_2_protocol_embedding[nctid] = embedding
#         for key in nctid_2_protocol_embedding: #DEBUG
#             print(f"{key}:{nctid_2_protocol_embedding[key].shape}") #DEBUG
    pickle.dump(nctid_2_protocol_embedding, open('data/nctid_2_protocol_embedding_dict.pkl', 'wb'))   
    return 

create_nctid2protocol_embedding_dict()

Embed sponsor information using all-MiniLM-L6-v2, a powerful pre-trained sentence encoder from SentenceBERT

I chose to encode trial sponsors using sentence embedding (SBERT) as well. Using simpler methods such as Label or One-Hot Encoding could work too, but I wanted to be able to catch similarities between sponsor names, in case there were typos or multiple different spellings for one sponsor. I use the pre-trained all-MiniLM-L6-v2 model which achieves high speed and performance on benchmark datasets. It converts each sponsor institution into a 384-dimensional vector.

def create_sponsor2embedding_dict():
    # Import toy dataset
    toy_df = pd.read_pickle('data/toy_df.pkl')

    # Create list with all indications and encode each one into a 384-dimensional vector
    all_sponsors = sorted(set(toy_df['lead_sponsor'].tolist()))     

    # Using 'all-MiniLM-L6-v2', a pre-trained model with excellent performance and speed
    model = SentenceTransformer('all-MiniLM-L6-v2')
    embeddings = model.encode(all_sponsors, show_progress_bar=True)
    print(embeddings.shape)

    # Create dictionary mapping indications to embeddings
    sponsor2embedding_dict = {}
    for key, row in zip(all_sponsors, embeddings):
        sponsor2embedding_dict[key] = row
    pickle.dump(sponsor2embedding_dict, open('data/sponsor2embedding_dict.pkl', 'wb'))
    
create_sponsor2embedding_dict()

Convert drug names to their SMILES representation and then to their Morgan fingerprint using DeepPurpose

Molecules can be represented in SMILES strings. SMILES is a line notation for encoding molecular structure. Drug molecule data is extracted from ClinicalTrials.gov and linked to its molecule structure (SMILES strings) using CACTUS.

import requests

def get_smiles(drug_name):
    # URL for the CIR API
    base_url = "https://cactus.nci.nih.gov/chemical/structure"
    url = f"{base_url}/{drug_name}/smiles"
    
    try:
        # Send a GET request to retrieve the SMILES representation
        response = requests.get(url)
    
        if response.status_code == 200:
            smiles = response.text.strip()  # Get the SMILES string
            print(f"Drug Name: {drug_name}")
            print(f"SMILES: {smiles}")
        else:
            print(f"Failed to retrieve SMILES for {drug_name}. Status code: {response.status_code}")
            smiles = ''
    
    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")

    return smiles

# Define the drug name you want to convert
drug_name = "aspirin"  # Replace with the drug name of your choice
get_smiles(drug_name)

### Output:
# Drug Name: aspirin
# SMILES: CC(=O)Oc1ccccc1C(O)=O

DeepPurpose can be used to encode molecular compounds. It currently supports 15 different encodings. We will use Morgan encoding, which encodes the atom groups of a chemical into a binary vector with length and radius as its two parameters. First we need to install the DeepPurpose library.

pip install DeepPurpose

Overview of DeepPurpose Encoders (Image by Huang et al., CC license)

We create a dictionary that maps SMILES to Morgan representation and a dictionary that maps clinical trial identifiers (NCTIDs) directly to their Morgan representation.

def create_smiles2morgan_dict():
    from DeepPurpose.utils import smiles2morgan 

    # Import toy dataset
    toy_df = pd.read_csv('data/toy_df.csv')
        
    smiles_lst = list(map(txt_to_lst, toy_df['smiless'].tolist()))
    unique_smiles = set(reduce(lambda x, y: x + y, smiles_lst))
    
    morgan = pd.Series(list(unique_smiles)).apply(smiles2morgan)
    smiles2morgan_dict = dict(zip(unique_smiles, morgan))
    pickle.dump(smiles2morgan_dict, open('data/smiles2morgan_dict.pkl', 'wb'))

def create_nctid2molecule_embedding_dict():
    # Import toy dataset
    toy_df = pd.read_csv('data/toy_df.csv')
    smiles_lst = list(map(txt_to_lst, toy_df['smiless'].tolist()))
    smiles2morgan_dict = load_smiles2morgan_dict()
    
    embedding = []
    for drugs in tqdm(smiles_lst):
        vec = []
        for drug in drugs:
            vec.append(smiles2morgan_dict[drug])
        # print(np.array(vec).shape) # DEBUG
        vec = np.mean(np.array(vec), axis=0)
        # print(vec.shape) # DEBUG
        embedding.append(vec)
    print(np.array(embedding).shape)
    
    dict = zip(toy_df['nctid'], np.array(embedding))
    nctid2molecule_embedding_dict = {}
    for key, row in zip(toy_df['nctid'], np.array(embedding)):
        nctid2molecule_embedding_dict[key] = row
    pickle.dump(nctid2molecule_embedding_dict, open('data/nctid2molecule_embedding_dict.pkl', 'wb'))  

create_nctid2molecule_embedding_dict()

Conclusion

Using publicly available clinical trial information we can create useful inputs for machine learning models by using feature embedding. To summarize, we:

Embedded disease indications using tiny-biobert by Rohanian et al., a compact version of BioBERT
Embedded clinical trial inclusion-/exclusion criteria using tiny-biobert
Embedded sponsor information using all-MiniLM-L6-v2, a powerful pre-trained sentence encoder from SentenceBERT
Converted drug names to their SMILES representation and then to their Morgan fingerprint using DeepPurpose by Huang et al.

In the second part of this series I will run a simple XGBoost model to predict the clinical trial outcome based on the embedded vector representations we created here. I will compare its performance to that of the HINT model.

References

Fu, Tianfan, et al. “Hint: Hierarchical interaction network for clinical-trial-outcome predictions.” Patterns 3.4 (2022).
Huang, Kexin, et al. “DeepPurpose: a deep learning library for drug–target interaction prediction.” Bioinformatics 36.22–23 (2020): 5545–5547.