The world’s leading publication for data science, AI, and ML professionals.

Analyzing the Center for Disease Control (CDC) Cancer Data using Pandas, Part 1

While research in cancer drug development and treatment has progressed over the years, cancer continues to claim the lives of thousands of…

Photo by Miguel Á. Padriñán on Pexels
Photo by Miguel Á. Padriñán on Pexels

While research in cancer drug development and treatment has progressed over the years, cancer continues to claim the lives of thousands of people every year. Despite this, the potential for progress in cancer research continues to grow with increasing access to data, compute power, and state of the art Machine Learning tools.

In this post we will explore the Center for Disease Control Cancer Dataset. The dataset includes information on brain tumors, cancer types by states, race, age, and much more. In this post we will explore the ‘BRAINBYSITE.TXT’ data.

We begin by importing the Pandas library and reading in the ‘.TXT’ file into a pandas data frame. Each column is separated by ‘|’, so we also set the separation parameter ‘sep’ appropriately. Lets also specify which columns we are interested in analyzing and display the first five rows of data to get a feel for the column types and values:

df = pd.read_csv("BRAINBYSITE.TXT", sep="|")
df = df[['AGE', 'BEHAVIOR', 'COUNT', 'POPULATION', 'SEX', 'YEAR', 'SITE"]]
print(df.head())

The ‘COUNT’ column contains some missing values which we can remove:

df = df[df['COUNT'] != '~']
df.reset_index(inplace=True)
print(df.head())

To start our analysis we can generate a histogram of the ‘COUNT’ column to visualize the distribution in tumors across all categories:

import seaborn as sns
import matplotlib.pyplot as plt
#settings for the histogram plot
sns.set(font_scale = 2)
plt.ylim(0, 80)
plt.xlim(0, 10000)
df['COUNT'] = df['COUNT'].astype(int)
df['COUNT'].hist(bins=1000)
Distribution in Tumor Counts
Distribution in Tumor Counts

We can also look at the histogram for the number of tumors for females and males on an overlaid plot:

#define female and male data frames
df_female = df[df['SEX'] == 'Female']
df_male = df[df['SEX'] == 'Male']
#overlay histograms
df_female['COUNT'].hist(bins=1000)
df_male['COUNT'].hist(bins=1000).set_title('Male and Female')
Overlay of distributions in tumor counts for males and females
Overlay of distributions in tumor counts for males and females

We can perform a much more granular analysis and look at the distribution in number of tumors for a given year, lets say ‘YEAR’ =2004, for females and males:

df_female = df[df['SEX'] == 'Female']
df_female = df_female[df_female['YEAR'] ==  '2012-2016']
df_female.loc[:, 'COUNT'] = df_female['COUNT'].astype(int)
df_female['COUNT'].hist(bins=1000).set_title('Female ')
Distribution in tumor counts for females between 2012–2016
Distribution in tumor counts for females between 2012–2016
df_male = df[df['SEX'] == 'Male']
df_male = df_male[df_male['YEAR'] == '2012-2016']
df_male.loc[:, 'COUNT'] = df_male['COUNT'].astype(int)
df_male['COUNT'].hist(bins=1000).set_title('Male')
Distribution in tumor counts for males between 2012–2016
Distribution in tumor counts for males between 2012–2016

Next we can generate a scatter plot of ‘CRUDE_RATE’ vs. ‘YEAR’. Here we will also adjust the size of the scatter plot points such that the sizes and hues are proportional to the ‘COUNT’. We also filter the data frame to get rows corresponding the ‘SITE’ = ‘Anaplastic astrocytoma’, ‘BEHAVIOR’= ‘Malignant’ and ‘CRUDE_RATE’ above 0.1. For females we have:

df_female = df[df['SEX'] == 'Female']
df_female = df_female[df_female['SITE'] == 'Anaplastic astrocytoma']
df_female = df_female[df_female['BEHAVIOR'] == 'Malignant']
#remove baseline value present in each year
df_female = df_female[df_female['CRUDE_RATE'] > 0.1]
sns.scatterplot(df_female["YEAR"], df_female["CRUDE_RATE"], sizes = (1000, 1500), size =  df_female["COUNT"], alpha = 0.8,hue = df_female["COUNT"])
Scatter plot of 'CRUDE_RATE' vs 'YEAR' of Anaplastic astrocytoma in females
Scatter plot of ‘CRUDE_RATE’ vs ‘YEAR’ of Anaplastic astrocytoma in females

We can also look at this plot for males:

df_male = df[df['SEX'] == 'Male']
df_male = df_male[df_male['SITE'] == 'Anaplastic astrocytoma']
df_male = df_male[df_male['BEHAVIOR'] == 'Malignant']
#remove baseline value present in each year
df_male = df_male[df_male['CRUDE_RATE'] > 0.1]
sns.scatterplot(df_male["YEAR"], df_male["CRUDE_RATE"], sizes = (1000, 1500), size =  df_male["COUNT"], alpha = 0.8,hue = df_male["COUNT"])
Scatter plot of 'CRUDE_RATE' vs 'YEAR' of Anaplastic astrocytoma in males
Scatter plot of ‘CRUDE_RATE’ vs ‘YEAR’ of Anaplastic astrocytoma in males

As you can see the crude rates of Anaplastic astrocytoma have increased between 2004 and 2016 for both females and males. It is also worth noting that the rate for males are higher than for females for this type of brain tumor. Try generating this scatter plot for other tumor types and see if there are any interesting differences between females and males.

The next thing we can do is generate some statistics from some of these data columns. We can define a function that outputs the unique set of values for our categorical variables and counts the number of times that value appears in the data:

from collections import Counter
def get_unqiue_values(feature):
    print("{} Unique Set: ".format(feature), set(df[feature]))
    print("{} Count: ".format(feature), dict(Counter(df[feature])))
get_unqiue_values('SEX')

When we call the ‘get_unqiue_values’ function with the "SEX" field the output shows a list of the unique values for each "SEX" category and the number of times each value appears in the data.

Uniques set of 'SEX' values and their frequencies
Uniques set of ‘SEX’ values and their frequencies

We can do the same for "BEHAVIOR":

get_unqiue_values('BEHAVIOR')

The output is a list of the unique values for the "BEHAVIOR" category and the number of times each value appears in the data.

Uniques set of 'BEHAVIOR' values and their frequencies
Uniques set of ‘BEHAVIOR’ values and their frequencies

And for the ‘SITE" category:

get_unqiue_values('SITE')

The output shows a list of the unique values for the ‘SITE" column and the number of times each value appears in the data.

Unique set of 'SITE' values and their frequencies
Unique set of ‘SITE’ values and their frequencies

There are comparatively more unique ‘SITE’ values than other categorical variables in this data set. We can narrow the scope by looking at the five most common (most frequently occuring) values:

from collections import Counter
def get_unqiue_values(feature):
    print("{} Unique Set: ".format(feature), set(df[feature]))
    print("{} Count: ".format(feature), dict(Counter(df[feature]).most_common(5)))
get_unqiue_values('SITE')
Most common 'SITE' values
Most common ‘SITE’ values

Visualizing some of this data can be a bit more useful than staring at dictionaries. To do this we can update our function ‘get_unqiue_values’ such that it returns a dictionary of the 10 most common brain tumors:

def get_unqiue_values(feature):
    print("{} Unique Set: ".format(feature), set(df[feature]))
    print("{} Count: ".format(feature), dict(Counter(df[feature]).most_common(5)))
    result = dict(Counter(df[feature]).most_common(10))
    return result

Next we iterate over the ‘SITE’ frequency dictionary and store the keys and values in a data frame:

key_list = []
value_list = []
for key, value in get_unqiue_values('SITE').items():
    key_list.append(key)
    value_list.append(value)
site_df = pd.DataFrame({'SITE': key_list, 'Count':value_list} )

Finally we define a barplot object using seabon and set the x-axis labels:

ax = sns.barplot(x=site_df.SITE, y=site_df.Count)
ax.set_xticklabels(site_df.SITE, rotation=30)
The frequency of tumor types
The frequency of tumor types

There is much more to explore in the ‘BRAINBYSITE.TXT’ data but for now we will conclude our analysis. In the next few posts I will continue to explore some of the other data sets provided in the Center for Disease Control Cancer Dataset. Until then, feel free to repeat this analysis on the other data sets yourself. The code shown in this post can be found on GitHub. Good luck and happy machine learning!


Related Articles