
As a data scientist, one of the last things I ever worry about or even think about is if the method of importing my data into my Jupyter Notebook or RStudio session is ok or not. I place my full faith in those two underscored words : "READ" "CSV" . Whether it be in Python or in R, my job is to clean, dissect and analyse the data. The reading is something I rarely think about other than what delimiter to separate on. Today however, I ran into two issues whilst using a read function in R’s Tidyverse package. Here’s what happened.
I was sent several TSVs from a client. I proceeded to open each with the above mentioned R function, appended them to a list and then concatenate them all into one large data frame. I did some checks and inspections and everything seemed to make sense. Columns were clear. Rows were populated. I was ready to work.
I proceeded to work and once I finished (or at least thought I’d finished), I ran my data through a report that gives some high level information about the quality of the data and analyses. It was only whilst reading the output of the report that I realised that there seemed to be some missing data. A lot of missing data!
I then checked the contents of each of the individual data frames saved to the list and found two anomalies.

As can be seen, two of the data frames have far fewer rows. I checked them by opening them manually and realised that they were indeed missing a lot of data in my session.
I then powered up a trusty Jupyter Notebook and used Pandas to read_csv
with sep = 't'
.
df_list = []
for file in files:
df = pd.read_csv(file, sep = 't')
df_list.append(df)
data = pd.concat(df_list)
I concatenated all the data frames and checked the row count, and there were roughly a million more rows of data.
Two technologies, two basic functions that do the same thing. But one left me missing ~1million rows of data.
But it didn’t end there. I had a lot of good code to work with in my RStudio notebook, so I exported the data from Python as one big file and reimported it into my RSession, this time with read_csv
(because I exported it from Python as CSV). Thankfully the missing million rows had returned.
I then did a few sanity checks and realised that many of the columns were now full of nulls! So whilst I now had the correct amount of rows, a lot of the data in them was missing.
I didn’t know what to blame, the import in R or the export in Python. This was easy to answer by re-importing it with Python. If there was no missing data, the issue wasn’t with the export. And that was exactly the case. All the data was there when re-importing to Python.
So I returned to R and tried to import the data not using a function within the Tidyverse, but rather using the fread
function of the data.table package. It is also just an ordinary data import package just like read_csv
or read_tsv
.
To my joy I not only had all the rows I needed, but also all the data was correctly populated within it. But to my misery it meant that the faith I had put in any file importing function until now needed to go.
Summary
Both issues came whilst using a function within the Tidyverse. There was no difference in the method of opening the original files when using R or Python and still R returned ~1million fewer rows. There was also no difference in the way I re-imported the data when using Tidyverse or data.table in R, yet Tidyverse introduced nulls and data.table didn’t. I’m not sure if the functions in Tidyverse are to blame, but the inconsistency is worrying and hard to understand.
I guess the reason to even right this is to warn of the possibility that your preferred method of importing data may sometimes prove less reliable than you need. Don’t place all of your faith in the function. Do your due diligence and make sure that the data frame you ended up with is the same as data you told the function to import.