Scientists offer encounter extraordinarily large data sets. This happens for a very good reason; more data gives a more thorough understanding of the phenomenon they’re studying. But it also creates a problem; as data sets gets larger they become harder and harder to understand and use.
One excellent solution is learning how to automatically split these data sets into separate files. In that way each data file becomes a manageable about of information explaining one aspect of the phenomenon. It’s easy to work with. And, by doing it automatically, you can make your life much easier with very little effort.
This article will teach you how to do that. It will walk you through the process, and it will provide specific Python code to do it (Python 2.7).
It does so in the context of a real life problem, often encountered in science and engineering. To understand the concepts being studied, see The Automated Performance Map Creation Tutorial.
You can learn the concepts by reading this article. If you want to take things a step further, and leave the tutorial with confidence in your skills and useful tools, you can download the companion data set. This companion data set will allow you to test your code and check your results, ensuring that you’re learning this process correctly.
What will be covered in this article?
In this portion of the tutorial we will cover the process of splitting a data set containing results from multiple laboratory tests into individual files. Those files will each contain the results of a single test, and will have descriptive file names stating what data is contained within them. We will be using the techniques described in Automating Scientific Analysis Part 2.
Without further ado, let’s get started. The first step is import the Python packages that will enable the data analysis process.
How do I import packages in Python?
Each Python script needs to start with statements importing the required packages and capabilities. In this data file splitting script we will need:
- Pandas: This package is the premier option for data analysis in Python. It allows you to read your data into DataFrames, essentially tables, and provides a vast array of tools for manipulating that data. A wealth of information on Pandas can be found in the creator’s book Python for Data Analysis.
- os: os is a Python package that enables you to use commands from the computer’s operating system, impacting the computer outside of the data analysis process. In this case we will use it to create new folders.
- Bokeh: Bokeh is an interactive plotting tool in Python. It enables you to write code that automatically generates plots while analyzing your data, and gives options for the user to interact with them. A great source for further understanding Bokeh is Hands-On Data Visualization with Bokeh.
In this case we will pull in the entirety of both Pandas and os, but only specific functions from Bokeh. To do that, add the following code to the start of your program.
import pandas as pd
import os
from bokeh.plotting import figure, save, gridplot, output_file, ColumnDataSource
from bokeh.models import HoverTool
You can see those four lines importing the stated packages. Note that the line to import pandas also specifies that pandas has been imported as pd, meaning that we can refer to pandas as "pd" in the rest of the code. Also note that "ColumnDataSource" is on the code line that starts with "from bokeh.plotting".
Now that our packages have been imported, the next step is reading the necessary data so the script can work with it.
How do I read the data files?
Pandas has a fantastic tool for importing data sets. It’s the read_csv function. In order to read a file, you call the pandas.read_csv function, specify the location of the file, and set it to a variable. There are several other modifiers you can use to customize the import if desired, but we won’t be using them here.
This command needs to be used to import two files. The first is the data set itself. If you downloaded the companion data set, it is titled ‘COP_HPWH_f_Tamb&Tavg.csv’. If we imagine that you saved the file in the folder ‘C:UsersYourNameDocumentsAutomatedDataAnalysis’ then we can import the data with the following code:
Data = pd.read_csv(r'C:UsersYourNameDocumentsAutomatedDataAnalysisCOP_HPWH_f_Tamb&Tavg.csv')
Executing that code will cause the data set to be saved to the variable ‘Data’. Then all of Pandas data analysis capabilities can be used on the data set by referencing Data.
The second file that will be necessary is a table describing the tests contained in the file. For the sake of learning, it’s helpful if you create the table yourself. The data set contains the results from three tests, with different ambient temperatures (Ambient temperature refers to the temperature of air around the tested device). To create this data set, generate a table with the following information and save it as ‘Test_Plan.csv’ in the same folder as your data set.

Note that you will later reference the names of the columns, so it’s important to make sure that you use the same text as in the example data.
Now that the table is created and saved, you can read it into Python by writing:
Test_Plan = pd.read_csv(r'C:UsersYourNameDocumentsAutomatedDataAnalysisTest_Plan.csv')
Now that the data is read into the Python script, the next step is to identify the rows where each test ends and the next team beings.
How do I identify where each test ends?
In order to identify where each test ends and the next test begins you need some knowledge about the tests themselves. You need to think about how the tests are being performed, and identify a condition that would indicate that one test is ending.
In this case we’re analyzing data from tests studying heat pump water heaters (HPWH), which use electricity to heat water. Since we’re looking at how much electricity it consumes to heat the water, we can know that it’s consuming electricity during each test. This means that each test ends when the device stops using electricity.
We need to identify rows where the device stops using electricity. We can do this by subtracting the electricity consumption in each row from the electricity consumption in the previous row. If the result is negative, that means the HPWH is consuming less electricity than previously and the test has ended.
We can do this by using the .shift() function on our data set. This function does what it sounds like; it shifts the data by a specified number of rows. We can use .shift() to create a new row in the data frame that contains electricity consumption, P_Elec (W), data that has been shifted by one row. We can do this with the following line of code:
Data['P_Elec_Shift (W)'] = Data['P_Elec (W)'].shift(periods = -1)
This leads to two different columns in the data frame describing the HPWHs electricity consumption. P_Elec (W) states the electricity consumption, in Watts, in each row. P_Elec_Shift (W) states the electricity consumption, in Watts, of the next row. If we subtract P_Elec (W) from P_Elec_Shift (W), rows with negative values will indicate that the HPWH has stopped heating. We can do that with the following code:
Data['P_Elec_Shift (W)'] = Data['P_Elec_Shift (W)'] - Data['P_Elec (W)']
At this point we have a data frame with a row that contains 0 in every row except rows where each test ended. We can use that information to create a list that tells us when each test ended. We’ll call that list ‘End_Of_Tests’, to clearly signify that the information contained within it. We will then use the .index.tolist() function to populate that list. This can be done with the following code:
End_Of_Tests = []
End_Of_Tests = End_Of_Tests + Data[Data['P_Elec_Shift (W)'] < 0].index.tolist()
The first line creates the empty list ‘End_Of_Tests.’ While it originally holds no data, it’s ready to accept data from other commands. The second line adds data to End_Of_Tests. It says to look through Data to identify rows where ‘P_Elec_Shift (W)’ is negative, identify the index of those rows, and add them to the End_Of_Tests list.
Now that we’ve identified the rows that correspond to the end of each test we can split the data set into separate data sets, one for each test.
How do I split the data file?
The data file can be split into more manageable files using the following steps:
- First, you need to repeat the process once for each test. This means we need to iterate through it once for each entry in End_Of_Tests.
- Second, you need to create a new data frame that is a subset of the original data frame containing only data from a single test.
- Third, you need to use the conditions of the test to identify the test in the test plan that this data represents.
- Fourth, you need to save the data to a new file with a file name that states what data is contained in the file.
Iterating through End_Of_Tests
The first step can be accomplished with a simple for loop that iterates through the End_Of_Tests list. This can be done with the following code:
for i in range(len(End_Of_Tests)):
This creates a loop that will run x times, where x is the number of rows in End_Of_Tests/the number of tests contained in the file. Note that i will be an increasing integer (0, 1, 2, and so on) and can be used an an index. Also note that we now have an active for loop, so all future code will need to be indented until we leave the for loop.
Creating new data frames with subsets of the data
The second step can be accomplished by using the values of End_Of_Tests to identify the rows of Data corresponding to each test. In the first test, this means we need to select the data between the first row and the first value in End_Of_Tests. In the second test, this means we need to select the data between the first value in End_Of_Tests and the second value in End_Of_Tests. And so on for the third, and for more if we had more than three tests in this data set.
This difference in handling between the first test (Which starts at hard-coded row 0) and the future tests (Which start at an entry in End_Of_Tests) we need an if statement that changes the code based whether or not we’re pulling out the first test.
The code then needs to use the End_Of_Test values to identify the section of Data that we want, and save it to a new data frame.
This can be accomplished with the following code:
if i == 0:
File_SingleTest = Data[0:End_Of_Tests[i]]
else:
File_SingleTest = Data[End_Of_Tests[i-1]:End_Of_Tests[i]]
The first line checks to see if this code is being executed for the first time. If it is, that means it’s the first time through the loop and we’re extracting the first test. If it is, the code pulls the first row of Data (Index 0) through the first entry in End_Of_Tests (Denoted with End_Of_Tests[i], which is currently End_Of_Tests[0]) and stores it in a new data frame called File_SingleTest. If it’s not the first time through the code,that means we need to extract data from a test that is not the first. In that case we extract data from when the previous test ended (End_Of_Tests[i-1]) to when the current test ends (End_Of_Tests[i]) and save it to File_SingleTest.
Note that the data is always saved to File_SingleTest. This means that the data frame containing data from a single test will always be overwritten in the next iteration. It’s important to save the data before that happens!
Identifying the conditions of each test
Now we have a data frame containing the data of a specific test. But which test is it? We need to read the data to understand what happens in that test, and compare it to the specifications in the test plan. In that way we can identify which test is in the data frame and give the data frame a descriptive name.
Looking at the test plan, we see that the ambient temperature changes in each test. Test Number 1 has an ambient temperature of 55 degrees Fahrenheit, Test Number 2 has 70 degrees Fahrenheit, and Test 3 has 95 degrees Fahrenheit. This means that ambient temperature is our descriptor here.
We can identify the ambient temperature during a test with the following code:
Temperature_Ambient = File_SingleTest['T_Amb (deg F)'][-50:].mean()
This line reads the last 50 entries ([-50:]) of the column representing ambient temperature (‘T_Amb (deg F)’) in the File_SingleTest data frame, and calculates the average (.mean()). It then stores that value in Temperature_Ambient. In this way we have the ambient temperature for the last few minutes of the test stored in a variable.
The second step is to compare this value to the test plan and identify which test was performed. This is important because no test data will ever be perfect, and the average ambient temperature will not perfectly match the specification in the test plan. For example, a test with a 55 degree ambient temperature specified may actually have a 53.95 degree ambient temperature. Identifying the corresponding test makes file management easier.
The corresponding test can be identified by 1) Calculating the difference between the average temperature in the test and the ambient temperature called for in each test, and 2) Identifying the test with the minimum difference. This can be done with the following two lines of code:
Test_Plan['Error_Temperature_Ambient'] = abs(Test_Plan['Ambient Temperature (deg F)'] - Temperature_Ambient)
Temperature_Ambient = Test_Plan.loc[Test_Plan['Error_Temperature_Ambient'].idxmin(), 'Ambient Temperature (deg F)']
The first line adds a new column to the Test_Plan data frame that states the absolute value of the difference between the ambient temperature called for in that test and the average ambient temperature in the active test. The second line uses the .loc() and .idxmin() functions to identify the ambient temperature specified in the test with the least error and set that ambient temperature to our Temperature_Ambient variable.
Now we’re ready to save the data in a new file.
Saving the data to a new file
With a data frame containing the results from a single test, and knowledge of the conditions of that test, we can now save the results to a new file. This section will show you one of the techniques discussed in Automatically Storing Results from Automated Data Sets.
The first step is to ensure that the folder we want to save the data in exists. We could do that manually, but this is an article about automating things! Let’s program the script to do it for us.
Let’s say we want the data to be stored in the file ‘C:UsersYourNameDocumentsAutomatingDataAnalysisFiles_IndividualTests’. To make sure that folder exists we can use the following code:
Folder = 'C:UsersYourNameDocumentsAutomatingDataAnalysisFiles_IndividualTests'
if not os.path.exists(Folder):
os.makedirs(Folder)
The first line sets the path of our desired folder to the variable Folder. The second line uses the os function .path.exists() to check and see if that folder exists. If it doesn’t exist, it executes the third line of code to create the folder. In this way we can ensure that it exists.
Once the folder exists, we can use the same approach to save the data file into that folder. We specify the file name we want to use, using variables to contain data about the ambient temperature, and use the Pandas .to_csv() function to save the file where we desire. This can be done with the following code:
Filename_SingleTest = "PerformanceMap_HPWH_" + str(int(Temperature_Ambient)) + ".csv"
File_SingleTest.to_csv(Folder + Filename_SingleTest, index = False)
The first line creates the filename we want to use. It descriptively states that it’s a file containing data from testing used to create a performance map of a HPWH. The second part is more important. It takes the ambient temperature that we identified in the test plan as a variable, converts it to an integer, converts it to a string, and adds it to the filename. Now the filename contains the conditions of the test, telling you exactly what the file contains before you even open it.
The second line does the grunt work. It combines the folder specified previously with the filename for the current data frame, and saves it. Note that it also removes the index, because saving that isn’t important and helps keep files cleaner.
What’s the final step?
Now you’re at the fun part of writing this script. You get to run it. And you get to watch the program generate the results that you need for you.
Note that this process is a bit overkill in this tutorial. The companion data set has results from three fabricated tests. It wouldn’t be very hard, time consuming, or tedious to do this in a project with three tests. But what if you had 1,000 tests? Then this process becomes extremely valuable.
How can I check my results?
There are two steps to checking your results in this process.
The first is to ensure that you got the right files as outputs. To do that, you compare the files in your new folder to the tests called for in the test plan. There should be one file for each test, with the conditions in the file name matching the conditions called for in the test plan.
In this process, you should see the following files in your folder:

Notice how there are three files in that folder, and three tests specified in the test plan. Also notice how the test plan calls for tests at 55, 70, and 95 degrees, and those three temperatures are specified in the three file names. So far, it looks like everything worked correctly.
The second step is to examine the data contained in each file. The easiest way to do that is by plotting the data and visually examining it (Though, we will be discussing ways to do this automatically later).
A quick check is to create a plot showing the ambient temperature from each data file. This can be done with Bokeh.
First, we need to create a new column in the data frame that gives us the test time in user-friendly terms. In minutes since the test started. We can do this by adding the following line to our program (If we assume that the time between measurements is 5 seconds):
File_SingleTest['Time_SinceStart (min)'] = File_SingleTest.index * 10./60.
That gives us a user-friendly time to use as the x-axis in our plot. Then we can use bokeh to create and save the plot. We do that with the following code:
p1 = figure(width=800, height=400, x_axis_label='Time (min)', y_axis_label = 'Temperature (deg F)')
p1.circle(File_SingleTest['Time_SinceStart (min), File_SingleTest['T_Amb (deg F)'], legend='Ambient Temperature', color = 'red', source = source_temps)
The first line creates a figure called p1 and specifies both the size and axis labels for the plot. The second line adds red circles to the plot with the x-values specified by ‘Time_Since Start (min)’ and the y-values specified by ‘T_Amb (deg F)’. It also states that the legend reading should be ‘Ambient Temperature’.
The plot can be saved using the gridplot and outputfile features.
p=gridplot([[p1]])
output_file(Folder + 'PerformanceMap_HPWH_T_Amb=' + str(int(Temperature_Ambient)) + '.html', title = 'Temperatures')
save(p)
Bokeh has a handy feature called gridplot that enables you to store multiple plots in a single file. This makes it very convenient to look at related plots next to each other, comparing the data in them. This feature is not necessary for this tutorial, so we only entered our current plot (p1) in the grid. But you should know about it in case you want it in the future.
The second line specifies where the file is to be saved. It goes in the same folder where we saved the .csv files of the data, and uses the same file name convention as before. The difference is that the data was saved in .csv files, and this is saved in a .html file.
The third line finally saves the data.
If you re-run the code, you’ll see that same .csv files in your results folder. Additionally, you’ll now find new .html files. These .html files contain the plots of the data sets.
What would you expect to see if you open the plots?
First, you’d expect the ambient temperature recorded during the tests to be similar to the values specified in the test plan and file name. The tests at 55 degrees should have ambient temperatures around 55 degrees, and so on.
Second, with this being real world data you shouldn’t expect it to be perfect. Some values will be 54.5, others will be 55.2, and so on. Don’t freak out about that.
Third, you should expect to see the temperature adjusting at the start of the test. The original values will be at the temperature from the previous test, then the lab will need to gradually change the temperature to the new setting.
If we open the plot from the 55 deg test, we should see exactly that. Here’s what your result should look like:

Notice how the temperature starts at 75 degrees, then rapidly decreases to 55 degrees. As expected. Notice how the average temperature throughout the test is clearly 55 degrees. As expected. Also notice how the data bounces around 55 degrees instead of being exactly on, also as expected.
This file implies that the test was performed correctly, and the data file was split correctly. You’re ready to move on to the next step!
What’s next?
This article was step 1 in a tutorial teaching you how to automate your scientific data analysis. Now that the overwhelmingly large data file is split into three separate files, one for each test, we can begin to make use of those data files. The next step is to check the process the data files, performing our analysis. When the analysis is done we can then check the results to ensure that the test and calculations went correctly.
Tutorial Table of Contents
This is a part of a series of articles teaching you all of the skills needed to automatically analyze laboratory data and develop a performance map of heat pump water heaters. The other articles in the series can be found using the following links:
Automatically Analyzing Laboratory Test Data
Checking Analyzed Laboratory Data for Errors