It is always nice to process the data using modern tools like Pandas or Jupyter. But let’s imagine the case when a colleague or friend asks to make a data analysis, but he or she is not a technical person, does not use Python or Jupyter, and does not have any account in Tableau, Power BI, or any other fancy (but, alas, not free) service. In this case, processing the data in Google Sheets can be a nice workaround because of several reasons:
- Google is used worldwide; at the time of writing this article, more than 1.8 billion users have a Google Account. Practically almost everyone has a Google account nowadays, and document sharing will be extremely easy.
- Google’s ecosystem is safe and secure. It supports two-factor authentication and modern security standards, and even private datasets can be shared between limited groups of people.
- Last but not least, the solution is free and does not require any extra costs. And as a bonus, Google Sheets works in the browser, does not require installing any software, and can work on any platform like Windows, Linux, OSX, or even on a smartphone.
In this article, I will make a basic exploratory data analysis in Pandas, then we will repeat this process in Google Sheets and see how it works.
Data Source
To make things more fun, let’s use a real dataset. We will make a tool to calculate the energy generated by solar panels. To do this, I will use the PVGIS (European Commission Photo Voltaic Geographical Information System) data, which can be accessed for free via this URL (CC BY 4.0 Licence):
Using this page, we can download solar irradiation data, allowing us to calculate energy generation. As can be seen in the screenshot, we can select hourly data for different years and different locations. After downloading the data, let’s use it in Pandas.
EDA in Pandas
Let’s start with exploratory data analysis (EDA) in Pandas. It’s always easier to use a tool, we’re familiar with; it will also allow us to verify the results. First, let’s load the dataset:
import pandas as pd
import datetime
df_eu = pd.read_csv("EUTimeseries_53.087_5.859_SA2_60deg_120deg_2020_2020.csv",
skiprows=8).dropna()
display(df_eu)
The code is self-explanatory. The beginning of the CSV file has comments and empty lines, so I used "skiprows=8" to skip the unwanted data; this was the only "tuning" required to read the file.
The output looks like this:
We have 8784 rows, representing the hourly collected data. According to the documentation, "G(i)" is the solar irradiance in Watts/m²; other parameters, like wind speed or temperature, are not needed for our task. The timestamp is not standard, let’s convert string values to date and time objects. I also need to convert "G(i)" values from string to float:
def str_to_date(d: str):
""" Convert string to datetime object """
try:
return datetime.datetime.strptime(d, '%Y%m%d:%H%M')
except:
return None
def str_to_float(f: str):
""" Convert string value to float """
try:
return float(f)
except:
return None
df_eu['time'] = df_eu['time'].map(str_to_date)
df_eu['G(i)'] = df_eu['G(i)'].map(str_to_float)
Now we are ready to do the calculations we want. The dataset contains the solar radiation in watts per square meter. The data was collected in 1-hour intervals, and we only need to divide values by 1000 to convert Watts to Kilowatts per hour (kWh). To get the final output in kWh, we also need to know the number of solar panels and the size and efficiency of each panel (this data can be found in the solar panel datasheet):
panels_amount = 1
panel_size_m2 = 2.5
panel_efficiency = 0.18
df_eu["kWh"] = panels_amount * panel_size_m2 * panel_efficiency * df_eu['G(i)'] / 1000
Now we can do some data exploration. Let’s find the solar generation per day, for example, in summer, on the 1st of June. I will be using the Bokeh Python library to draw the results:
from bokeh.io import show, output_notebook
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure
output_notebook()
df_day = df_eu[df_eu['time'].dt.date == datetime.date(2020, 6, 1)]
source = ColumnDataSource(df_day)
p = figure(width=1600, height=600, x_axis_type='datetime',
title="Solar Panels Generation Per Day, kWh")
p.vbar(x='time', top='kWh', width=datetime.timedelta(minutes=50), source=source)
p.xgrid.grid_line_color = None
p.xaxis.ticker.desired_num_ticks = 12
p.y_range.start = 0
p.y_range.end = 0.4
show(p)
The output looks like this:
It is also easy to get the total generated amount:
print("Total, kWh:", df_day["kWh"].sum())
> Total, kWh: 1.5560864999999997
Our solar panel generated 1.56 kWh per day in June. As a comparison, the generation from the same panel in December was much lower:
Let’s see the generation per year and calculate the total output. To do this, I will group the dataframe by month:
df_eu["month"] = df_eu["time"].dt.month
df_eu_month = df_eu[["month", "kWh"]].groupby(["month"], as_index=False).sum()
display(df_eu_month.style.hide(axis="index"))
The output looks like this:
And as a final step, let’s see the chart:
source = ColumnDataSource(data=dict(months=df_eu_month["month"],
values=df_eu_month["kWh"]))
p = figure(width=1600, height=600,
title="Solar Panels Generation Per Year, kWh")
p.vbar(x='months', top='values', width=0.95, source=source)
p.xgrid.grid_line_color = None
p.xaxis.ticker.desired_num_ticks = 12
show(p)
The output:
As we can see, there is a drastic difference between the winter and summer months.
Finally, let’s get the total electricity generation during the year:
print("Total, kWh:", df_eu["kWh"].sum())
> Total, kWh: 335.38783499999994
Our 2.5 m² solar panel in total generated 335 kWh of electricity.
At this point, let’s finish the Python coding and see how we can do the same in Google Sheets.
Google Sheets
In general, our Data Processing flow will be the same as it was in Pandas. We need to load the dataset, convert column values, filter, and group values to see the result. Actually, when we know what we want to do with data, we can think in a "cross-platform way", and this makes processing much easier.
First, let’s load the dataset. I created a document in Google Sheets with two tabs, "Source" and "Calculation", and imported the CSV into the "Source" tab:
Now we need to group values by month. Let’s extract month values from the timestamp column. In Pandas, we did it like this:
df_eu["time"] = df_eu['time'].map(str_to_date)
df_eu["month"] = df_eu["time"].dt.month
In Google Sheets, I added this formula in the G10 cell:
=ArrayFormula(MID(A10:A8793, 5, 2))
Here, G10 is the first cell, where results will be located, and A10:A8793 is our timestamp data. A "MID" function is extracting a month from a string (a "20200101:0011" is a non-standard timestamp, and using a substring is easier), and an "ArrayFormula" method is automatically applying this function to the whole table. After entering the formula, Google Sheets will automatically create a new column for us.
In the same way, let’s create a "kWh" column from "G(i)". In Pandas, we did it like this:
df_eu["kWh"] = df_eu['G(i)'] / 1000
In Google Sheets, it works almost the same way. I added this formula to the H10 cell:
=ArrayFormula(B10:B8793/1000)
And as a last preprocessing step, let’s enter "Month" and "kWh" names as headers. A result should look like this:
We have a "month" column; now we can group data by month. What in Pandas is named "groupby", in Google Sheets can be done using a "Pivot table". Let’s create a new pivot table and enter "Source!A9:H8793" as a data source. Here "Source" is the name of the first tab, and A9:H8793 is our data. The table editor will automatically detect the column names, and we can select "Month" as a "Row" and "kWh" as a "Value":
The result is displayed in the screenshot. Actually, it is pretty close to what we did in Pandas:
df_eu_month = df_eu[["month", "kWh"]].groupby(["month"], as_index=False).sum()
Our dataset is ready; let’s do the required calculations. In the "Calculation" tab, I will use the first three cells as "variables" for the number of solar panels (cell B1), each panel size (cell B2), and panel efficiency (cell B3). Then I can calculate the total generation by adding a formula:
=ArrayFormula(B1*B2*B3*Source!K11:K22)
"Source!" is a link to our first tab with the data source, and K11–K22 are cells where the data grouped by month is located. Our new result will be placed in cells B7–B18, and we can also calculate the total generation:
=SUM(B7:B18)
It is easy to add extra labels and a chart; it does not require any formulas, and I will skip this part here. Our final generation data, grouped by month, should look like this:
Obviously, the results must be the same as what we get in Pandas; otherwise, something is wrong. The Google Sheets UI is interactive; we can change the number of solar panels or the panel efficiency, and Google Sheets will automatically recalculate all results.
Our last data exploration step is to get electricity generation for a specific date. To do this, I will put the required date in the "A24" cell and use two cells for results. The first one will contain time, and the second one will contain energy values:
=FILTER(MID(Source!A10:A8793, 10, 4), SEARCH(A24, Source!A10:A8793))
=FILTER(B1*B2*B3*Source!H10:H8793, SEARCH(A24, Source!A10:A8793))
Here, SEARCH(A24, Source!A10:A8793) is the filter applied to the source table; the first formula is getting the time of day, and the second one is calculating the energy in kWh. Optionally, we can add labels and a summary. The final page may look like this:
Conclusion
In this article, we imported a solar generation dataset in Google Sheets and were able to calculate and visualise different parameters, like solar panel generation per month or per specific day. This sheet is interactive and can be used by anyone without technical or programming skills. Last but not least, the solution has zero cost, and the document can be safely shared with anyone who has a Google account.
Obviously, there are many jokes in the community about "Data Science in Excel", and I’m not encouraging anyone to use Google Sheets as a main production tool. But it can be a nice addition for simple scenarios where we need to share results with others or make a simple data processing UI, and as we can see, basic operations like making charts, grouping data, or extracting substrings work well.
Thanks for reading. If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors.