
Connect to your database, create a table for your data, query it for cleaning, then output your data as a custom file and then basically repeat the whole cycle till the last of the files. In this article, we will be creating a python script that automates the whole cycle mentioned above.
The process of cleaning data before loading it into your models is an integral part of your project’s life cycle. If you’re working with a company, you probably are using the same database as your QA team; this all together makes it a bit more complicated. In an environment where the accuracy of the data is vital for the system to be successful, manual data handling tasks are always a disadvantage as they may lead to unavoidable errors.
To avoid the errors, low speed, and cost of training a human, there is a need for automating your data cleaning/modifying tasks. For this article, we’ll learn some complex automation on your dataset, be it on your local machine, or cloud databases like [postgres](https://www.postgresql.org/), [MySQL](https://www.mysql.com/)
or [AWS](https://cloud4c.com/lp/aws-cloud-services/?utm_source=google_search&utm_medium=cpc&utm_campaign=MSP_Cloud_AWS_India_Exp_Srch&utm_term=%2Baws%20%2Bcloud&utm_content=528924743032&utm_device=c)
.
Tip: If you’re new to this, I’d recommend playing around AWS rather than implementing this as a
localhost
. Use pgAdmin along with it to access your database from your IDE itself.
The Cleantech Solar dataset has data from Inverters and Energy meters from over the months. This data needs to be further treated before processing, and it is a good example of the problem solved by this automation. We’ll be converting the raw data in Inverter-1, Inverter-2, MFM, and WMS
to our required Gen-1 file.
You can download the dataset and execute the source code here: https://github.com/preetshah7/csv_automated_modifier
Here’s a step-by-step outline for our automation:
- Defining the Problem Statement
- Identifying CSVs using bash commands and loops
- Creating a pandas DataFrame to perform your cleaning tasks
- Configure your output directories using the OS library
- Creating required Gen-1 files from the DataFrame
- Importing your data to a Postgres database on Amazon Web Services (AWS)
Defining the Problem Statement
We have the raw data for 2 Inverters, an energy meter(MFM), and a meteorological substation(WFM) separately located for 2 months.
The tasks we need to perform are as follows:
- For inverters, column i32 indicates the timestamp of the row. Make this the first column in the Gen1 file and rename the column header to ‘Timestamp’).
- MFM: Same rules as above; the only difference is the timestamp column is m63.
- WMS: Same rules as above; the only difference is the timestamp column is w23. Column names are given in the sample output file for each of the above.
Expected Output Format: [Station ID] | – -> [Year] | – ->[Year-Month] | – ->[Substation-ID] | – -> [Gen-1 Data.txt]
Identifying CSVs using bash commands and loops

We require the libraries numpy, pandas, and os for this implementation.
The month directories can be obtained using getcwd(),
store them as a list.
Create a function that loops through all the files in a directory with os.listdir()
and identifies CSVs that will be then appended to our list.
We have created a function that, given path as input, will automatically give a list of CSV files within.
Creating a pandas DataFrame to perform your cleaning tasks
First, we perform our task to a single file and then implement automation
Take the sample column names from the respective DataFrame by df.columns
Now, implement df.loc()
for repositioning the columns and assign the column names to our DataFrame.
Now that we have achieved our required output, we can automate this task for each meter.
Inverter:
For the input, we take a list of files
and the path
where the files are located. After the cleaning tasks, we modify the input path for our output. (check the next step for creating output directory)
Energy Meter(MFM):
The same function is for MFM, but we replace the 'm63'
column here. Remember to plug in the column names from the MFM sample output.
Meteorological Substation(WMS):
The column with the timestamp values here is 'w23'.
Be careful to use the correct separator in '[df.to_csv()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html)'
, 't'
works in this case.
We have successfully created our automation tasks to be implemented later.
Configure your output directories using the OS library
os – Miscellaneous operating system interfaces – Python 3.9.6 documentation
Club the meters with the months of data to obtain all the configs and then implement os.mkdir()
to create all the directories automatically.
Exception Handling in Python is easy to use compared to other languages. We use this for our directory configuration so that if the directory is already present, the code would still run without an error.

Creating required Gen-1 files from the DataFrame
Lastly, to implement our automation, we run a Nested For loop. It conditionally applies the specific modification task to the file based on the string of its input path. In a fraction of seconds, our result is viewed in the directories created by automation. You can add any volume of data to this automation, and it’ll consistently perform at high speed.
We have completed the automation from a local point of view, but we’ll continue to its cloud application.

Importing your data to a Postgres database on Amazon Web Services (AWS)

For this article, we have used a Postgres database on AWS. We also need pgAdmin running on our machine to establish a connection with the Database. Install the required library (psycopg2) on your environment; for other databases, the wrapper libraries may differ, but the cursor process remains the same.
This script will automatically import CSV files to your Postgres database. Just place the CSV files in the same directory as the notebook and run the notebook. We have successfully implemented Data Science task automation.
Summary
Here’s a summary of the step-by-step procedure we followed for building a CSV auto modifier:
- We defined the Problem Statement
- Identified CSVs using bash commands and loops
- Created a pandas DataFrame to perform our cleaning tasks
- Configured output directories using the OS library
- Created the required Gen-1 files from the DataFrame
- Imported data to a Postgres Database on Amazon Web Services (AWS)
Future Work
Here are some ways in which we can extend the project:
- Decrease the time complexity, so this could work quickly on huge volumes of data; use list comprehensions wherever possible.
- Implement the most common databases in practice so more people can take advantage of this work; common suggestions would be MongoDB and MySQL.
- Can implement Unsupervised Machine Learning in the automation itself, independent of the number of features. Data Cleaning would be considerably difficult here.
References
- Stack overflow answers for .ix and .iloc
- Automation Problem Statement released by Cleantech Solar.
- https://www.geeksforgeeks.org/os-module-python-examples/
- https://www.psycopg.org/docs/