
You will learn how to use the data sets from UCI that come with the .data file type in this quick article.
Where can data be found?
Kaggle.com is a great choice for finding data to use in your data science projects. The site is filled with interesting data sets, notebooks from other scientists and tutorials. All the data sets I have encountered on Kaggle have been .csv files, this is very convenient when working with pandas.
You might wonder (at least I did) if Kaggle is the only place where data can be found.
Hint: __It is not!
You will also find awesome data sets on UCI Machine Learning Repository. An example of an interesting data set is the Breast Cancer Wisconsin (Original) Data Set.
I recently wanted to use this exact data set to practice my classification skills. However, I quickly ran into some trouble (or so I thought). The data I had downloaded was contained in a .data file…

How do you work with that? I certainly didn’t know.
As I have only ever worked with .csv files (I am a relatively new data scientist) all I know how to do is use the Pandas read_csv() function to import my data sets into a DataFrame.
To download the data first click on the Data Folder which well take you to a second page (lower half of the following picture), here you click on the file you want to download.

The .data file can be opened with Microsoft Excel or Notepad. I tried doing the latter:

You can see that all the data points are separated with a comma!
Naturally I tried to implement the data in Google Colab. I was very curious as to whether it would work or not.

As you can see there is no problem with using read_csv() to read the data into a DataFrame.
This really shows how powerful Pandas are I think!
There is just one small thing missing I think. The column names. So lets add those.

Scroll down a bit on the page of a data set on UCI, and you will find the Attribute information. This provides the names for the features in the corresponding data set. Now we can add those to our DataFrame.
You add column names to your DataFrame with the .columns property on the DataFrame. Take a look:

Here is all the code from Google Colab if you want to try it yourself (you will have to download the data from UCI and upload it to the Colab document):
import pandas as pd
dataset = pd.read_csv('breast-cancer-wisconsin.data')
dataset.columns = ['Id', 'Clump_thickness', 'Uniformity_cell_size', 'Uniformity_cell_shape', 'Marginal_adhesion', 'Single_e_cell_size', 'Bare_nuclei', 'Bland_chromatin', 'Normal_nucleoli', 'Mitoses', 'Class']
dataset.head()
Did you know? The .data file type is actually a text file. It is used by a data mining software called analysis studio, however, the program is no longer being developed (source: Fileinfo, visited 15–08–2020).
I hope this short article was useful to you. I am happy that I now know that I can use .data files from UCI without a problem!
_Keep learning!
- Jacob Toftgaard Rasmussen_