Have you ever come across a use case in Spark where you cannot use the power of parallel processing of spark’s executors? In this article, we are going to explore one of the cool ways to use multithreading in Python for storing images in SQL database. This use-case cannot be achieved in Spark as there’s no way you can visualize your data unless you convert the data frame in pandas format. Once you convert into pandas, you lose all the advantages of spark. Therefore, to achieve a similar execution speed (may not be as fast as spark) we can leverage the multi-threading technique in Python.
Upon reading this article, you will learn:
- How to multi-thread in Python?
- How to efficiently store images in Sql database?
Use-Case
We are going to look at an hourly energy consumption dataset where we need to save the plots of consumption for each individual day for a duration of 3 months. Let’s look at the dataset which I downloaded from Kaggle:
Preparations
1. Install Libraries
You all must be knowing most of the libraries stated above. I’ll go through the less frequently used ones:
- Concurrent.futures – Used for launching parallel threads in python
- Repeat — Used to supply a stream of constant values
- Pyodbc — Used to install ODBC driver for Python to connect to SQL server. The following image shows how to connect to the SQL server which requires all the credentials:
2. Make a folder in your directory to store images
In order to store images in a SQL database, you first need to save each plot image in a .png format.
3. Generating and Storing Images
Now that you have installed the libraries and connected to the SQL server, you can begin the process of storing images. There are two different approaches through which we can achieve this:
- Our beloved, ‘For loop’:
- The above code uses a for loop to loop through all the unique days and create a plot for each day and saves it in the folder that we created earlier
- Afterward, we open each created image and store it in the SQL database in the already created table in the SQL database
- This process will take a lot of time if you have thousands of images to process upon. Hence, won’t be a scalable approach to move forward with
- With Multi-threading:
The above code uses concurrent.futures library to implement multi-threading. In the map function, you pass the ‘_plotconsumption ‘ function which will generate images for each date in the _list_dates
_ that __ is also passed as one of the arguments. Additionally, you can see how I am passing the dataframe in the repeat function that helps in supplying a constant stream of dataframe to all the concurrent threads processing each day.
It’s a wrap! I hope you learned something new today and can implement it in your projects and/or at workplace.
Thank you for reading!
- If you like my writing, then please subscribe to my list
- If you liked it, follow me on Medium
- Stay in touch on LinkedIn
References
- PJM – Search Results – Data Source
- Hourly Energy Consumption | Kaggle
- License to use the dataset