
While research in cancer drug development and treatment has progressed over the years, cancer continues to claim the lives of thousands of people every year. Despite this, the potential for progress in cancer research continues to grow with increasing access to data, compute power, and state of the art Machine Learning tools.
In this post we will explore the Center for Disease Control Cancer Dataset. The dataset includes information on brain tumors, cancer types by states, race, age, and much more. In this post we will explore the ‘BRAINBYSITE.TXT’ data.
We begin by importing the Pandas library and reading in the ‘.TXT’ file into a pandas data frame. Each column is separated by ‘|’, so we also set the separation parameter ‘sep’ appropriately. Lets also specify which columns we are interested in analyzing and display the first five rows of data to get a feel for the column types and values:
df = pd.read_csv("BRAINBYSITE.TXT", sep="|")
df = df[['AGE', 'BEHAVIOR', 'COUNT', 'POPULATION', 'SEX', 'YEAR', 'SITE"]]
print(df.head())

The ‘COUNT’ column contains some missing values which we can remove:
df = df[df['COUNT'] != '~']
df.reset_index(inplace=True)
print(df.head())

To start our analysis we can generate a histogram of the ‘COUNT’ column to visualize the distribution in tumors across all categories:
import seaborn as sns
import matplotlib.pyplot as plt
#settings for the histogram plot
sns.set(font_scale = 2)
plt.ylim(0, 80)
plt.xlim(0, 10000)
df['COUNT'] = df['COUNT'].astype(int)
df['COUNT'].hist(bins=1000)

We can also look at the histogram for the number of tumors for females and males on an overlaid plot:
#define female and male data frames
df_female = df[df['SEX'] == 'Female']
df_male = df[df['SEX'] == 'Male']
#overlay histograms
df_female['COUNT'].hist(bins=1000)
df_male['COUNT'].hist(bins=1000).set_title('Male and Female')

We can perform a much more granular analysis and look at the distribution in number of tumors for a given year, lets say ‘YEAR’ =2004, for females and males:
df_female = df[df['SEX'] == 'Female']
df_female = df_female[df_female['YEAR'] == '2012-2016']
df_female.loc[:, 'COUNT'] = df_female['COUNT'].astype(int)
df_female['COUNT'].hist(bins=1000).set_title('Female ')

df_male = df[df['SEX'] == 'Male']
df_male = df_male[df_male['YEAR'] == '2012-2016']
df_male.loc[:, 'COUNT'] = df_male['COUNT'].astype(int)
df_male['COUNT'].hist(bins=1000).set_title('Male')

Next we can generate a scatter plot of ‘CRUDE_RATE’ vs. ‘YEAR’. Here we will also adjust the size of the scatter plot points such that the sizes and hues are proportional to the ‘COUNT’. We also filter the data frame to get rows corresponding the ‘SITE’ = ‘Anaplastic astrocytoma’, ‘BEHAVIOR’= ‘Malignant’ and ‘CRUDE_RATE’ above 0.1. For females we have:
df_female = df[df['SEX'] == 'Female']
df_female = df_female[df_female['SITE'] == 'Anaplastic astrocytoma']
df_female = df_female[df_female['BEHAVIOR'] == 'Malignant']
#remove baseline value present in each year
df_female = df_female[df_female['CRUDE_RATE'] > 0.1]
sns.scatterplot(df_female["YEAR"], df_female["CRUDE_RATE"], sizes = (1000, 1500), size = df_female["COUNT"], alpha = 0.8,hue = df_female["COUNT"])

We can also look at this plot for males:
df_male = df[df['SEX'] == 'Male']
df_male = df_male[df_male['SITE'] == 'Anaplastic astrocytoma']
df_male = df_male[df_male['BEHAVIOR'] == 'Malignant']
#remove baseline value present in each year
df_male = df_male[df_male['CRUDE_RATE'] > 0.1]
sns.scatterplot(df_male["YEAR"], df_male["CRUDE_RATE"], sizes = (1000, 1500), size = df_male["COUNT"], alpha = 0.8,hue = df_male["COUNT"])

As you can see the crude rates of Anaplastic astrocytoma have increased between 2004 and 2016 for both females and males. It is also worth noting that the rate for males are higher than for females for this type of brain tumor. Try generating this scatter plot for other tumor types and see if there are any interesting differences between females and males.
The next thing we can do is generate some statistics from some of these data columns. We can define a function that outputs the unique set of values for our categorical variables and counts the number of times that value appears in the data:
from collections import Counter
def get_unqiue_values(feature):
print("{} Unique Set: ".format(feature), set(df[feature]))
print("{} Count: ".format(feature), dict(Counter(df[feature])))
get_unqiue_values('SEX')
When we call the ‘get_unqiue_values’ function with the "SEX" field the output shows a list of the unique values for each "SEX" category and the number of times each value appears in the data.

We can do the same for "BEHAVIOR":
get_unqiue_values('BEHAVIOR')
The output is a list of the unique values for the "BEHAVIOR" category and the number of times each value appears in the data.

And for the ‘SITE" category:
get_unqiue_values('SITE')
The output shows a list of the unique values for the ‘SITE" column and the number of times each value appears in the data.

There are comparatively more unique ‘SITE’ values than other categorical variables in this data set. We can narrow the scope by looking at the five most common (most frequently occuring) values:
from collections import Counter
def get_unqiue_values(feature):
print("{} Unique Set: ".format(feature), set(df[feature]))
print("{} Count: ".format(feature), dict(Counter(df[feature]).most_common(5)))
get_unqiue_values('SITE')

Visualizing some of this data can be a bit more useful than staring at dictionaries. To do this we can update our function ‘get_unqiue_values’ such that it returns a dictionary of the 10 most common brain tumors:
def get_unqiue_values(feature):
print("{} Unique Set: ".format(feature), set(df[feature]))
print("{} Count: ".format(feature), dict(Counter(df[feature]).most_common(5)))
result = dict(Counter(df[feature]).most_common(10))
return result
Next we iterate over the ‘SITE’ frequency dictionary and store the keys and values in a data frame:
key_list = []
value_list = []
for key, value in get_unqiue_values('SITE').items():
key_list.append(key)
value_list.append(value)
site_df = pd.DataFrame({'SITE': key_list, 'Count':value_list} )
Finally we define a barplot object using seabon and set the x-axis labels:
ax = sns.barplot(x=site_df.SITE, y=site_df.Count)
ax.set_xticklabels(site_df.SITE, rotation=30)

There is much more to explore in the ‘BRAINBYSITE.TXT’ data but for now we will conclude our analysis. In the next few posts I will continue to explore some of the other data sets provided in the Center for Disease Control Cancer Dataset. Until then, feel free to repeat this analysis on the other data sets yourself. The code shown in this post can be found on GitHub. Good luck and happy machine learning!