The world’s leading publication for data science, AI, and ML professionals.

Automatically Storing Results from Analyzed Data Sets

How to Store Data Analysis Results to Facilitate Later Regression Analysis

This is the fifth article in a series teaching you to how to write programs that automatically analyze scientific data. The first presented the concept and motivation, then laid out the high level steps. The second taught you how to structure data sets to make automated data analysis possible, and automatically identify the conditions of each test. The third article discussed creating a for loop that automatically performs calculations on each test result and saves the results. The fourth post covered what is likely the most important part: Automatically checking the data and analysis for errors. This fifth post will teach you how to store data in a logical folder structure, enabling easy access to the data for regression development and validation.

Storing Intermediate Results for Later Analysis

So far all of the discussion has focused on analyzing results from individual tests. The next step is to begin to think about the bigger picture and create ways to combine those individual test results into data sets describing the results from the entire project. The first step is storing the individual test results in a logical manner that facilitates later analysis.

There are two general tricks to storing intermediate results for later analysis in an automated process. The first is planning the organizational structure to ensure that all files can be easily located when needed. The second is using dynamic file names in the code so that the results are saved to new files with each iteration through the program.

Creating the Folder Hierarchy

Planning the organizational structure essentially means creating a folder hierarchy that makes sense for a given project. For example, say that a project includes performing several experiments on multiple pieces of equipment. The goal is to create regressions emulating the performance of each piece of equipment. In this case, there’s value in creating a folder for each piece of equipment, then storing results from individual tests within the corresponding folders. Figure 1 provides an example of how this folder hierarchy could be structured.

Figure 1: Example Folder Hierarchy
Figure 1: Example Folder Hierarchy

Storing Files Using Dynamic Names

The second point to keep in mind is that all references to stored data should use dynamic names that use variables, taken from the data set using the techniques described in Part 2, to create filenames specific to that data set. For example, a data set may contain data specific to Equipment 2 Test 3. In that case, any code saving data for that data set must use variables to specify that it needs to use the "Equipment 2 Test 3" subfolder of the "Equipment 2" folder.

When creating the folder structure, it is necessary to ensure that all folders exist. There are two approaches to doing this. The first, is to manually create folders for the project, laying everything out ahead of time. That may be a good approach if it helps you think through the process and create a strong structure, but this is a series on Automation! It’s easier to let Python do the work. The structure can be created automatically by including the appropriate code in the analysis loop. It is done using the following steps.

1) Import the os package, enabling access to commands controlling the computer’s operating system. This can be done with the Python code "import os".

2) Within the analysis loop, use the techniques in Part 2 to determine which test is being performed. Using the hierarchy table in Figure 1 as an example, this might result in a variable "Equipment" set to "Equipment 2" and a variable "Test" set to "Test 3". Ensure that both values are stored in their variables as strings.

3) Specify the folder for the existing data set using variables and input from the data set. In our current example, this could be done with the following code:

Folder = r'C:/Users/JSmith/DataAnalysis/' + Equipment + '/' + Test

4) Determine whether or not the folder exists using the os.path.exists command, and create the folder if needed using the following code:

if not os.path.exists(Folder) :

Those steps create code that will automatically generate all folders needed for the structure. The same techniques can be used to create further levels of subfolders as needed for any given project.

Then, the results of each test need to be stored in the appropriate folders. The code for saving the results varies between packages. Results can be saved with pandas, bokeh, and matplotlib using the following code examples.


Data frames have the conveniently labeled .to_csv function. Readers should consult the pandas documentation for specific details of how this works, but the general approach is to call the function and specify the file path. For the current example and a data frame called ‘Data’, this can be done with the following code:

Data.to_csv(r'C:/Users/JSmith/DataAnalysis/' + Equipment + '/' + Test + '/' + Equipment + '_' + Test + '.csv')

The final portion, ‘/’ + Equipment + ‘_’ + Test + ‘.csv’, was added to the previous code to provide a name to the .csv file placed in the folder. A shorter way to accomplish the same objective, assuming that the previous code was used to define the variable Folder, is:

Data.to_csv(Folder + '/' + Equipment + '_' + Test + '.csv')


bokeh uses a somewhat more complicated approach to saving files. This provides the ability to store multiple plots within a single file. It is performed using the following steps:

1)Create a gridplot. The gridplot function allows specification of how multiple plots should be contained within a single file. One array is used to specify the main gridplot, while smaller arrays can be used to specify multiple plots within any given row. For example, a gridplot with two plots on the first row and three on the second would be programmed like:

p = gridplot([[p1, p2], [p3, p4, p5]])

2) Specify the desired file location, and the desired title. Continuing the example of Equipment 2 – Test 3, this could be done with the following code:

output_file(Folder + '/' + Test + '.html', title = Test + '.html')

3) Save the plot. This is done with the intuitive save() command. The syntax to save the plot in this example is:



matplotlib uses a very simple file saving convention. The command is plt.savefig(). The syntax for this example is:

plt.savefig(Folder + '/' + Test)

Next Steps

This article taught you how to modify your code to save data from individual tests in logical locations. This sets you up for the next phase of the process, which is to use the data to generate regressions. By storing data in logical places you’ve made it easy to open that data and use it as the data set for later regression analysis. The next article will cover that exact topic: How to create, validate, and document regressions of your data sets.

Related Articles