This is the third post in a series teaching you how to automate analysis of scientific laboratory data. The first article introduced the concept, provided some motivation, and described the general process. The second described several methods for identifying the conditions and creating individual data files for each test. This third post will teach you how to write a program that automatically opens each file, and performs calculations on those data sets.
Automatically Opening Each Data File
The first step in automatically analyzing each data file is opening them. However, since we want to automatically analyze numerous files we can’t simply write a line of code opening a specific file; instead we create a structure that sequentially opens every file in the project. This can be accomplished by using glob, pandas, a for loop, and the following steps:
- First, store all files from the project in a specific folder. Having all of the files in the same folder is helpful because then you only need to point the program to a single path. If you used the dynamic filename recommendations from Part 2 you should be able to identify the differences between the tests without storing the files in many different folders.
- Second, use the glob package to create a list of all files in that folder that match the type of your data files. The glob.glob function serves this purpose perfectly and only requires a path and a file type as an input. For instance, if you want to open all files in a folder you might use the following three lines of code (The Path variable is made up and will need to be modified to match your particular case). The first line imports the glob package, the second line tells the code where the files are located, and the third line creates a variable called ‘Filenames’ that is populated with all .csv files in the folder specified by Path.
import glob
Path = r'C:UsersJSmithDocumentsAutomatedDataAnalysisExample'
Filenames = glob.glob(Path, '/*.csv')
- Third, create a for loop that iterates through each of those files. Code opening and analyzing those files can be stored within the for loop. By doing so, you create a series of analysis steps that the program will sequentially perform on each file in Filenames. But first we need to create the for loop. It can be done using the following code, which creates a new variable ‘Filename’ that will temporarily hold each path stored in ‘Filenames’:
for Filename in Filenames:
- Finally, we need to open the file. Assuming that the data is stored in .csv files you can open it with the pandas.read_csv function. Assuming that your previous code imported pandas, you can open each file by adding the following line within the for loop:
Data = pd.read_csv(Filename)
And that’s it! Your code now sequentially opens each file contained within the folder. Which means that the next step is to add code within the for loop telling the program what to do with the data.
Automatically Analyzing Each Data Set
With the file opened, the program can move on to perform the actual analysis of the data. This process must be highly customized for each project as the calculations must be tailored to that specific project. Due to that limitation this section will provide a few tips for Python novices rather than trying to provide specific steps to follow. A few important tips include:
- Opening the files with pandas.read_csv created a pandas DataFrame object containing the data from that file. A DataFrame is essentially a table. Opening the file this way gives access to all of the data manipulation capabilities of pandas. If you aren’t familiar with those considering perusing the pandas documentation for tips.
- Calculations in Python can be performed with any number of useful packages. For this portion of the project, pandas and numpy are likely to provide the most important functionality. The goal is to find packages and write code that performs all of the calculations you need to do on each file.
- If you need to target a specific portion of each file don’t forget to filter the data using a technique similar to those described in Part 2.
- Use matplotlib, bokeh, or another package to plot the results as needed.
- When you’re done with your calculations, you can print the result with the pandas DataFrame.to_csv function. Make sure to give it a new filename to avoid overwriting the raw data file.
Next Steps
The second and third posts combined to give you a thorough guide to automatically analyzing each of the individual data files and saving the results. With the files opened, the calculations performed, and the data analyzed it’s time for the most important part of the process: Checking the data sets for errors. My next post will teach you a few ways to add code to your program that makes it easy for you to check the data sets for errors.