Estimating Counterfactual Energy Usage of Buildings with Machine Learning

Can we make ML models to predict a building’s energy usage? Absolutely!

Steven Smiley

Published in

Towards Data Science

22 min readJan 29, 2020

Table of Contents (TOC)

Abstract
Background
Materials & Methods
Results & Conclusions
References

1. Abstract

The American Society of Heating, Refrigeration, and Air Conditioning Engineers (ASHRAE)¹ is one of the largest energy efficiency research societies in the world. They were founded in 1894 and have over 54,000 members, serving 132 countries. Since 1993, they have hosted 3 large data competitions, aiming to predict building energy use. The most recent one was hosted in October 2019 on Kaggle.² There was a grand total of $25,000 in prizes split among the top 5 in this competition. The 1st place team was awarded $10,000. The competition ended December 19, 2019. I was late to the competition, but was very interested in it because of my background in Mechanical Engineering (ME) and my passion for Machine Learning (ML). Therefore, I decided to tackle the problem after the due date in order to better understand ML and how it could apply to my field, ME. Thus, this article attempts to investigate how much energy a building will consume based on the recent October 2019 Kaggle data. Why? Because as the competition states:²

Assessing the value of energy efficiency improvements can be challenging as there’s no way to truly know how much energy a building would have used without the improvements. The best we can do is to build counterfactual models. Once a building is overhauled the new (lower) energy consumption is compared against modeled values for the original building to calculate the savings from the retrofit. More accurate models could support better market incentives and enable lower cost financing.

Figure 0.0 Use of prediction models for energy savings interventions from IPMVP.³³ This picture illustrates use of prediction models as comparison to energy savings for longterm building performance models.²⁴

2. Background

In order to conduct this analysis, four Jupyter notebooks were constructed in Python using the data from ASHRAE and its contributors.³ ⁴ ⁵ ⁶ ⁷

The Raw Input Data comes in 6 different files with 17 unique features among them.

As shown above, there are 16 unique site ids ,hosting 1448 unique buildings. The buildings primary use ( primary_use ) range among 16 unique categories (i.e. Office, Education, etc). In addition, there are 4 unique meters to measure the energy use ( 1.chilled water , 2.electric , 3. hot water , 4. steam ).

Furthermore, since the time range is between 2015 and 2018, there are over 61 million unique time stamps! This is a lot of data. In fact, the size of those files for training and testing were ~0.7 and 1.4 GBs respectively, requiring a less than trivial approach to data handling!

Quick disclaimer, using the data outside of the competition is allowed for educational purposes:²

Data Access and Use. You may access and use the Competition Data for any purpose, whether commercial or non-commercial, including for participating in the Competition and on Kaggle.com forums, and for academic research and education.

The accuracy of these models is evaluated based on the Root Mean Squared Logarithmic Error (RMSLE):

Equation 0. Root Mean Squared Logarithmic Error(RMSLE) for model evaluation.

Where:

ϵ is the RMSLE value (score)
n is the total number of observations in the (public/private) data set,
pi is your prediction of target, and
ai is the actual target for i.
log(x) is the natural logarithm of x.

3. Materials & Methods

Figure 0.1 Overall Flow Chart for Jupyter Notebooks Used.

One thing that became apparent was the sources of “Leaked” data for this problem.²⁹ I did not want to use the “Leaked” data because I wanted a fair assessment of how my models would work on unseen future data. In my opinion, that is a much more valuable approach, especially since I am not doing it for the competition. During the competition, this was common and appeared to be the only way to win.

What makes my method unique is that I broke down the problem by Site_ID . Therefore, my approach breaks the data up into 16 models since there are 16 unique Site_ID numbers. This significantly reduced the amount of data to train with at a time. It also allowed me to focus on the unique differences within each Site_ID. Whatever trends I noticed within a Site_ID, I was able to better model for. This wouldn’t be the case with the basic total aggregated approach.

Most people appeared to aggregate all of the data to form the prediction, yielding one ML model to predict the future energy usage. A few tried splitting the data by meter type, so there were 4 unique models.²⁸ Some tried to split the data into two, based on the time of the year for training.¹⁰ The winning team had a blending & stacking approach with the “Leaked” data as well as with a combination of Site_ID, Meter , and Building & Meter.³¹

After a little digging on the web, I found a recent thesis by William Heden²⁵ that describes a few more methods that had some parallels to the winning team. His thesis was not on this problem specifically, but on 187 household energy usages.

“First, each households was modeled independently, yielding a total number of 187 models whose predictions were aggregated to form the total prediction. Secondly, the households were aggregated and a single model was developed, treating the customer base as a single unit. A third option consisted of grouping similar households based on the average daily load profile of each household.”²⁵

Our problem has 1448 unique buildings, which would mean theoretically we could make 1448 unique models like the first approach from William. However, this approach was not optimal in William’s thesis. The optimal approach was to cluster the data based on the average daily load profile of each household (option 3).

Therefore, my approach was to use something similar to that optimal approach with eachSite ID because they appeared to have statistically different average meter readings through 1-way ANOVA (p-value ~0). This means we can reject the null-hypothesis (each Site has the same average meter reading), assuming the outliers and messy data are not obscuring the true means.

import pingouin as pgaov = pg.anova(data=data_train, dv=’meter_reading’, between=[‘site_id’], detailed=True)

Figure 0.2 Box Plot of Raw Training Data, excluding outliers by limiting y-axis to 2000 kWh. Notice the mean for Site 13 isn’t even within this range.

Figure 1.0 Flow Chart of Part-1-Divide.ipynb.

Part 1. Divide Data by Unique Site ID’s

Part-1-Divide.ipynb is a short notebook, but very effective.

Make Directories for Output Files. This was where I made the 16 unique output folders as shown in Figure 1. These folders are the placeholders to import and export files to when creating unique models for each Site ID.

Splits =['Site_ID_0',
         'Site_ID_1',
         'Site_ID_2',
         'Site_ID_3',
         'Site_ID_4',
         'Site_ID_5',
         'Site_ID_6',
         'Site_ID_7',
         'Site_ID_8',
         'Site_ID_9',
         'Site_ID_10',
         'Site_ID_11',
         'Site_ID_12',
         'Site_ID_13',
         'Site_ID_14',
         'Site_ID_15']

Import and View Data. I just made sure that the data was there as described above before splitting it up by unique Site ID . I made sure there was no null-values for the Site ID's, and I also made sure that I could capture the index or reference for every timestamp. This is essential for glueing everything back together at the end of the process.

building = file_loader(building)
weather_train_data = file_loader(weather_train_data)
data_train = file_loader(data_train)
weather_test_data = file_loader(weather_test_data)
data_test = file_loader(data_test)

Figure 1.1. Some figures from Part-1-Divide.ipynb Jupyter Notebook.

Reduce Data Memory. These files were huge! The test.csv fle was over 1 GB alone. Thus I needed to find a way to reduce the memory of those files without losing it’s useful information. I found on the Kaggle discussion boards some very useful data minification strategies that I thought were really useful for not only this project, but future ones.³² Essentially, you could reduce the data down to the bare minimum bits of information by searching through its data type and changing it respectively. This is absolutely necessary in order to keep the script running and not crashing your computer when you try to run all of the data at once for training ML models. I used this strategy for the rest of the workflow.

building = reduce_mem_usage(building)
weather_train_data = reduce_mem_usage(weather_train_data)
weather_test_data = reduce_mem_usage(weather_test_data)
data_train = reduce_mem_usage(data_train)
data_test = reduce_mem_usage(data_test)

Merge Data. After I reduced the main files to their smallest useful form, I merged them together.

merge()

data_train.merge(building, on=’building_id’, how=’left’)
data_train.merge(weather_train_data,
                  on=[‘site_id’,‘timestamp’], how=’left’)data_test.merge(building, on=’building_id’, how=’left’)
data_test.merge(weather_test_data,
                  on=[‘site_id’,‘timestamp’], how=’left’)

Divide & Export Data. The pickle function was a great tool for exporting these files after they were merged and separated.

count=0
for Split_Number in list(Splits): 
    dummy = data_train[(data_train[‘site_id’]==count)]
    # OUTPUTS: Folder for storing OUTPUTS
    print(Split_Number)
    dummy.to_pickle(os.path.join(OUTPUT_split_path[count],
                    ‘site_id-{}-train.pkl’.format(count)))
count+=1

Figure 2.0 Flow Chart for of Part-2-And.ipynb.

Part 2. Exploratory Data Analysis & Cleaning

I started working on this problem blindly without looking at the discussion boards. This led me down a torturous path of finding and trying to understand what to do with the missing and zero value data. However, this didn’t go far because of how messy and large the raw data was.

So then I started to read up on what people were doing about the large messy data and saw several approaches to it. People were quick to point out what outliers they discovered like:

Buildings showing meter readings for electricity before being built.³⁰
Periods of constant readings for long periods of time.³¹
Big positive and negative spikes.³¹
Sites having buildings with meter reading anomalies at the same frequency.³¹
Buildings in general missing data or having zero values (obvious one I noticed right off the bat).

For the missing data, I made this little function, missing_table , in order to capture and summarize them.

missing_table()

def missing_table(data_name):
    non_null_counts = data_name.count() 
    null_counts = data_name.isnull().sum() 
    total_counts = non_null_counts+null_counts
    percent_missing= round(100*null_counts/total_counts,1)
    min_non_null = data_name.min()
    median_non_null = data_name.quantile(q=0.5)
    max_non_null = data_name.max()
    missing_data=pd.concat([total_counts,non_null_counts,
                            null_counts,percent_missing,
                            min_non_null,median_non_null,
                            max_non_null],axis=1,keys=   
                       ['Total Counts','Non-Null Counts',
                       'Null Counts','Percent Missing(%)',
                       'Non-Null Minimum','Non-Null Median',
                       'Non-Null Maximum'])
    return missing_data

Figure 2.1. An example of a missing data table summary from Site ID 15. There is a lot of data missing!

How did I handle all of this mess? Well.. I upsampled the data first to hourly since that was the frequency at which it came in raw form most. Doing this though adds more null values because it creates timestamps for hours not recorded. Therefore, I did the following to fill those voids and the large amount previously missing with a series of dropping, filling null values, interpolating, forward filling, backward filling, and finally filling null values again.

resample(“H”).mean()

I upsampled the data by taking the average value after grouping it by their unique building_id, meter, site_id, primary_use, and square feet. This put some blank spaces for other items like air_temperature, dew_temperature, wind_speed, etc. This isn’t a problem though. Interpolation will handle most of those gaps.

pd.to_datetime(data_train["timestamp"],format='%Y-%m-%d %H')
data_train=data_train.set_index('timestamp')

pd.to_datetime(data_test["timestamp"],format='%Y-%m-%d %H')
data_test=data_test.set_index('timestamp')
grouplist=['building_id','meter','site_id',
           'primary_use','square_feet']

data_train.groupby(grouplist).resample('H').mean()
data_train.drop(grouplist,axis=1)

data_test.groupby(grouplist).resample('H').mean()
data_test.drop(grouplist,axis=1)

dropna()

If a column had what I thought was too much missing data, I dropped it for that Site_ID. Therefore, for columns with more than 40% missing data for a site, I did the following:

thresh = len(data_train)*.6
data_train.dropna(thresh = thresh, axis = 1, inplace = True)

thresh = len(data_test)*.6
data_test.dropna(thresh = thresh, axis = 1, inplace = True)

fillna()

I didn’t want to lose track of my index when upsampling the data, therefore, I kept track of the original index for reference by setting new values to -1.

data_train['index'].fillna(-1, inplace = True)
data_test['index'].fillna(-1, inplace = True)

interpolate()

The following continuous parameters were interpolated for missing values.

data_train['meter_reading'].interpolate()
data_train['air_temperature'].interpolate()
data_train['dew_temperature'].interpolate()
data_train['cloud_coverage'].interpolate()
data_train['precip_depth_1_hr'].interpolate()
data_train['sea_level_pressure'].interpolate()       data_train['wind_direction'].interpolate()data_test['air_temperature'].interpolate()
data_test['dew_temperature'].interpolate()
data_test['cloud_coverage'].interpolate()
data_test['precip_depth_1_hr'].interpolate()
data_test['sea_level_pressure'].interpolate()       data_test['wind_direction'].interpolate()

pad()

Usually not much of the data was missing at this point. This forward filling function, pad(), usually got the last bit. It uses the last known value and moves it forward to the next if missing.

grouplist=['building_id','meter','site_id',
           'primary_use','square_feet']data_train=data_train.groupby(grouplist).pad()
data_test=data_test.groupby(grouplist).pad()

isnull()

At this point, I would use the isnull() function to check if I had anything missing left. If so, I used the previous fillna() with a median value or dropped it completely. In addition, I took individual Site ID notes in the Jupyter Notebook for where things needed specific effort.

data_train.isnull().sum()
data_test.isnull().sum()

Outliers… well now that the data is at least complete. What about Outliers?

There were two main things I did:

Dropped Meter Readings that were Outlier High (spikes).

I assumed they would make it difficult to make an accurate ML model with as many others assumed as well.³¹

data_train.drop((data_train.loc[data_train['meter_reading']> 1.5e+04])['meter_reading'].index)

2. Dropped Meter Readings that were zero value.

I assumed these were meant to be data or they got filled in the previous steps because nothing was really there. It might have been a little redundant in the later, but at least inclusive.

data_train.drop((data_train.loc[data_train['meter_reading']== 0])['meter_reading'].index)

Now we can see something!

It is hard to view data that is messy and missing lots of values. So now that the data is cleaned, I just wanted to verify. Here are a few plots showing air_temperature distributions for a few sites. You can see some of these sites have their own seasons and probably are in different climates geographically. Notice the left skew for Site 13, normal for Site 12, and bimodal distribution for Site 14.

Figure 2.1. A few air temperautre figures from different sites, showing their unique distributions.

Figure 3.0 Flow Chart for of Part-3-Conquer.ipynb.

Part 3. Feature Engineering, Feature Extraction, and Machine Learning.

feature_engineering()

def feature_engineering(data):
    data["hour"] = data["timestamp"].dt.hour
    data["week"] = data["timestamp"].dt.week
    data["month"] = data["timestamp"].dt.month
    data["weekday"] = data["timestamp"].dt.weekday    data['Sensible_Heat'] = 0.5274*(10.**
    (-4.))*data['square_feet']*(75.-data['air_temperature'])

    data['log_square_feet'] = np.log(data['square_feet'])
    data['log_floor_count'] = np.log(data['floor_count'])
    
    data['square_dew_temperature']    
    =np.square(data['dew_temperature'])
    

    # Holidays
    holidays = ["2016-01-01", "2016-01-18", "2016-02-15", "2016-05-30", "2016-07-04",
            "2016-09-05", "2016-10-10", "2016-11-11", "2016-11-24", "2016-12-26",
            "2017-01-01", "2017-01-16", "2017-02-20", "2017-05-29", "2017-07-04",
            "2017-09-04", "2017-10-09", "2017-11-10", "2017-11-23", "2017-12-25",
            "2018-01-01", "2018-01-15", "2018-02-19", "2018-05-28", "2018-07-04",
            "2018-09-03", "2018-10-08", "2018-11-12", "2018-11-22", "2018-12-25",
            "2019-01-01"]
    data["is_holiday"] =(data.timestamp.dt.date.astype("str").isin(holidays)).astype(int)
       
    
    return data

Most people appeared to Feature Engineer using time on their side. For example, the week of the year was a common feature that had a strong correlation with the target energy usage. This makes sense because a year has its seasons like winter or summer and the week number (i.e. 51 of 52, winter week) informs what season you are in. I used some of these time strategies, but went a little deeper on the Mechanical Engineering side of the problem. I saw all of these base features as inputs to equations I had been using and working with in my everyday engineering life. I figured I could engineer new features that might have a relationship with the target outcome.

Thus, I dug into the laws of thermodynamics and HVAC.

We want to predict how much energy usage occurs for a given meter based on building usage data and environmental weather data. It is important to understand the factors involved and how they relate.

The Big Picture view comes from the 1st law of thermodynamics:

Equation 1. First Law of Thermodynamics.

Where:

Q is heat added to the system
W is the work done by the system
Delta U is the change in internal energy.

Thus, these buildings change the amount of heat they add or remove based on environmental conditions (i.e. air temperature, dew temperature, wind_speed, wind_direction, and cloud_coverage), requiring work in the form of energy (electricity, steam, etc).

One of the big connections between each building and the first law of thermodynamics is the building’s Heating, Ventilation, and Air Conditioning (HVAC) units.

Factors that can affect how efficiently these HVAC units add or remove heat could be associated with the age of the building ( year_built ). Older buildings could be less efficient at adding or removing heat due to out of date HVAC units (unless they are renovated!). If the HVAC system was not properly maintained over the years, then it might not produce as much cold air as normal due to refrigerant or airflow problems.²⁷

The ventilation rate for heat removal on the basis of the sensible heat due to the occupants of the room can be expressed as:

Equation 2a. Room Ventilation Rate for Heat Removal.

Where:

q_dot is the sensible heat removal rate for the room.
rho is the average air density for the room.
cp is the air’s constant specific heat for the room.
Vroom is the volume of the room.
Tid is the indoor design temperature for the room (typically 75 degrees Fahrenheit).
Tin is the temperature of the air coming into the room.

Thus, bigger buildings probably need more energy or HVAC capacity due to more people and larger air volumes. The volume of air inside a building is proportional to the number of rooms and their respective volumes. Thus, the square footage of the building ( square_feet ) is proportional to the building air volume, which is proportional to the amount of sensible heat removal needed for ventilation. However, the minimum ventilation rates required depend on the type of the building and the individual rooms unique requirements as discussed in ASHRAE Standard 62.1. For a rough proportional relationship, we can feature engineer a new variable, qs , sensible heat removal rate per building based on this as follows:

Equation 2b. Step Outline for Feature Engineering the Sensible Heat Removal.

There are standards though! ASHRAE Standard 62.1, specifies the required breathing zone outdoor air (i.e. the outdoor ventilation air in the breathing zone), Vbz, as a function of both zone occupancy, Pz, and zone floor area, Az.²⁶ The first term ( RpAz) accounts for contaminants produced by occupancy while the second term ( RaAz) accounts for contaminants produced by the building. ASHRAE Standard 62.1 requires that the following rate is maintained during operation under all load conditions:

Equation 3. ASHRAE Standard 62.1 Minimum Required Breathing Zone Outdoor Air Rate.²⁶

Where:

Vbz is the Minimum Required Breathing Zone Outdoor Air Rate
Rp is the People Outdoor Air Rate
Pz is the Zone Occupancy
Ra is the Area Outdoor Air Rate
Az is the Zone Floor Area.

Thus, each building ( building_id) probably would have its own unique design requirements based on this universal ASHRAE 62.1 standard. The inputs that go into that standard are based on the design number of people occupying the room and the type of building or room being occupied. For example, an Educational zone might have a required 10 cfm/person, while an Office Building zone might require 5 cfm/person in that room.²⁶ What this means is that we could expect their to be different clusters of energy usage based on the building types ( primary_use or site_id).

Figure 3.1. Feature Importance Plot of different Site ID’s results from final run.

Splitting the Data

I split the data so that a portion was captured for each season for training and validation with the Machine Learning (ML) algorithms. Therefore, Winter was captured by January (1), Spring was captured by April (4), Summer was captured by July (7), and Fall was captured by October (10). Since we had so much data, I further split it in half. Therefore, half of those testing months were given back to training. Depending on the Site_ID this ranged between 75% to 90% training, which is within the rule of thumb range of 80%. This is a Time Series Regression problem, so it is important not to have the data all in one month or too frequent because of overfitting. I believe the way I split it was a fair method for only having a full-year of training data. It is roughly equally spaced throughout to predict two years in the future.

final_test = test

test = train[(train['timestamp'].dt.month.isin([1,4,7,10]))]

train_w, test = train_test_split(test,test_size=.50,random_state=42)

train = train[(train['timestamp'].dt.month.isin([2,3,5,6,8,9,11,12]))]

train=pd.concat([train,train_w],axis=0)

Scaling the Data

I used the MinMaxScaler() to get values between 0 and 1 for the ML algorithms. I realize it might have been better to use StandardScaler() because of the possible outliers.

#Scale Data
scalerTrain = MinMaxScaler(feature_range=(0,1))

X_train=scalerTrain.fit_transform(X_train)
X_test=scalerTrain.transform(X_test) 
X_final_test=scalerTrain.transform(X_final_test)

Machine Learning with LightGBM

I tried using Random Forest Regression and XGBoost regressor before deciding to go with LightGBM regressor in the end. From my preliminary runs, it appeared that Random Forest was a lot less accurate and took a lot longer to train. XGBoost was less accurate and took longer than LightGBM. The training for these problems could take hours, depending on how things were set-up.

Originally when I threw all of my data at the algorithms, it would take overnight, and usually my kernel would crash. I even went to Google Cloud Platform and started running models from the cloud. Even those kernels crashed… haha. This was the other reason I decided to go with 16 different models that were smaller. The data was manageable for me. I could see what I was doing, and I could fix things when needed. I could also get work done without having my computer just slow to a halt from all of the RAM allocation to local runs.

So I ended up fine-tuning some Hyperparameters, but probably not well enough. Mainly because it took so long to get these things done. But I saw great progress between my base run and my last one as of writing this article. Essentially my base run had a Root Mean Squared Logarithmic Error (RMSLE) of 1.566 public and my final run was a 1.292 public score (~20% better). I know these scores are not the greatest, but that wasn’t the point of this whole thing anyways since it is post competition. It was a learning experience.

My best Hyperparameters for LightGBM were:

best_params = {
    "objective": "regression",
    "boosting": "gbdt",
    "num_leaves": 1500,
    "learning_rate": 0.05,
    "feature_fraction": ([0.8]),
    "reg_lambda": 2,
    "num_boost_round": 300, 
    "metric": {"rmse"},
}

I split the training data that was alread split like mentioned before into K-fold of 3 for training/validation as follows:

kf = KFold(n_splits=3,shuffle=False)
count =0
models = []
for params in list(params_total):
    print('Parms: ',count )
    for train_index,test_index in kf.split(features_for_train):
        train_features = features_for_train.loc[train_index]
        train_target = target.loc[train_index]

        test_features = features_for_train.loc[test_index]
        test_target = target.loc[test_index]

        d_training = lgb.Dataset(train_features, 
                                 label=train_target, 
                                 free_raw_data=False)
        d_test = lgb.Dataset(test_features, 
                             label=test_target,
                             free_raw_data=False)

        model = lgb.train(params, 
                          train_set=d_training, 
                          valid_sets=[d_training,d_test], 
                          verbose_eval=100, 
                          early_stopping_rounds=50)   
    
        models.append(model)
    count += 1

I then predicted on the test set (NOT the final test set!), how well this model would perform as shown below. Notice that because there were 3 folds, there was 3 optimized best models made ( best_iteration). Therefore, the final result for me was based on the average from those 3.

I used the same method for the final test, meaning that I didn’t change which model I used after training/validation. I just looked at the figures shown below to see how the models were performing on the data I could compare it against before throwing it at the data I couldn’t compare it to (the unseen future test data, 2017 to 2019). They look to perform pretty well! RMSLE’s are less than 1 and the correlation coefficient is strong for most all (>0.9).

results = []
for model in models:
    if  results == []:
        results = np.expm1(model.predict(features_for_test, num_iteration=model.best_iteration)) / len(models)
    else:
        results += np.expm1(model.predict(features_for_test, num_iteration=model.best_iteration)) / len(models)

Figure 3.2. Linear Regression comparison of Actual Target Test Data vs. Predicted Values with LightGBM.

Figure 4.0 Flow Chart for of Part-4-AllSiteIds.ipynb.

4. Results & Conclusions

Part 4. Evaluate All Models Together.

So…now it is time to sum up all of the results from all 16 site_id predictions.

count=0
for Split_Number in list(Splits):  
        dummy= os.path.join(OUTPUT_split_path[count],
              "test_Combined_Site_ID_{}.csv".format(count))        dummy2 = pd.DataFrame(file_loader(dummy))
        
        if count == 0:
                    test_Combined = dummy2
        else:
                    test_Combined = 
                    pd.concat([test_Combined,dummy2],axis=0)
        count+=1

Figure 4.1. Showing improvement from Base run.

Like mentioned in the previous section, that was good improvement on the RMSLE, nearly 20% on both the public score and private scores. Obvisouly further tweaking on ML Hyperparameters and Data Cleaning would probably help get a better score. What I think is more impressive is what is shown below!

So… what do these numbers really look like?!

Figure 4.2 Average Hourly Consumption (kWh) of all Meter Readings for all Buildings on each Site ID.

Wow. I think that looks impressive! That is two years (red lines) in the future we do not know when training our models, and intuitively it looks very reasonable. The blue lines show how well we predicted to data we could compare to, which looks reasonable as well. I think we can safely say that we can make Machine Learning (ML) models to predict a buildings future energy usage. What a fun and exciting project! I learned so much on this journey as well as how much I don’t know. Thank you for reading!

5. References

ASHRAE. About ASHRAE. Retrieved January, 2020 from https://www.ashrae.org/about
Kaggle. ASHRAE-Great Energy Predictor III (2019). Retrieved January, 2020 from https://www.kaggle.com/c/ashrae-energy-prediction/overview
ASHRAE. Data Driven Modeling (DDM) Subcommittee of ASHRAE Technical Committee 4.7: Energy Calculations
SinBerBest. Singapore Berkeley Building Efficiency and Sustainability in the Tropics
BUDS Lab. Building and Urban Data Science
Engineering Experiment Station Texas A&M University
Mr. Chris Balbach, ASHRAE Contest Administration Team Member
Dr. Jeff Haberl, ASHRAE Contest Administration Team Member
Dr. Krishnan Gowri, ASHRAE Contest Administration Team Member
Vopani, Kaggle Notebook, “ASHRAE: Half and Half.” Retrieved January, 2020 from https://www.kaggle.com/rohanrao/ashrae-half-and-half
CeasarLupum, Kaggle Notebook, “ASHRAE — Start Here: A GENTLE Introduction.” Retrieved January, 2020 from https://www.kaggle.com/caesarlupum/ashrae-start-here-a-gentle-introduction
CeasarLupum, Kaggle Notebook, “ASHRAE — LigthGBM simpleFE.” Retrieved January, 2020 from https://www.kaggle.com/caesarlupum/ashrae-ligthgbm-simple-fe
Roman, Kaggle Notebook, “EDA for ASHRAE.” Retrieved January, 2020 from https://www.kaggle.com/nroman/eda-for-ashrae#meter
Sandeep Kumar, “ASHRAE — KFold LightGBM — without leak (1.08).” Retrieved January, 2020 from https://www.kaggle.com/aitude/ashrae-kfold-lightgbm-without-leak-1-08
SciPy. Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E.A. Quintero, Charles R Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. (2019) SciPy 1.0–Fundamental Algorithms for Scientific Computing in Python. preprint arXiv:1907.10121
Python. a) Travis E. Oliphant. Python for Scientific Computing, Computing in Science & Engineering, 9, 10–20 (2007) b) K. Jarrod Millman and Michael Aivazis. Python for Scientists and Engineers, Computing in Science & Engineering, 13, 9–12 (2011)
NumPy. a) Travis E. Oliphant. A guide to NumPy, USA: Trelgol Publishing, (2006). b) Stéfan van der Walt, S. Chris Colbert and Gaël Varoquaux. The NumPy Array: A Structure for Efficient Numerical Computation, Computing in Science & Engineering, 13, 22–30 (2011)
IPython. a) Fernando Pérez and Brian E. Granger. IPython: A System for Interactive Scientific Computing, Computing in Science & Engineering, 9, 21–29 (2007)
Matplotlib. J. D. Hunter, “Matplotlib: A 2D Graphics Environment”, Computing in Science & Engineering, vol. 9, no. 3, pp. 90–95, 2007.
Pandas. Wes McKinney. Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, 51–56 (2010)
Scikit-Learn. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay. Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 12, 2825–2830 (2011)
Scikit-Image. Stéfan van der Walt, Johannes L. Schönberger, Juan Nunez-Iglesias, François Boulogne, Joshua D. Warner, Neil Yager, Emmanuelle Gouillart, Tony Yu and the scikit-image contributors. scikit-image: Image processing in Python, PeerJ 2:e453 (2014)
Author: Plotly Technologies Inc. Title: Collaborative data science Publisher: Plotly Technologies Inc. Place of publication: Montréal, QC Date of publication: 2015 URL: https://plot.ly
Miller, C. More Buildings Make More Generalizable Models — Benchmarking Prediction Methods on Open Electrical Meter Data. Mach. Learn. Knowl. Extr. 2019, 1, 974–993.
W. Hedén, ‘Predicting Hourly Residential Energy Consumption using Random Forest and Support Vector Regression : An Analysis of the Impact of Household Clustering on the Performance Accuracy’, Dissertation, 2016.
LINDEBURG, MICHAEL R. MECHANICAL ENGINEERING REFERENCE MANUAL. 13th ed., PROFESSIONAL PUBLICATIONS, 2013.
ASHRAE. Technical Resources, “Top Ten Things Consumers Should Know About Air Conditioning.” Retrieved January, 2020 from https://www.ashrae.org/technical-resources/free-resources/top-ten-things-consumers-should-know-about-air-conditioning
NZ, “Aligned Timestamp -LGBM by meter type.” Retrieved January, 2020 from https://www.kaggle.com/nz0722/aligned-timestamp-lgbm-by-meter-type
Kaggle. ASHRAE-Great Energy Predictor III (2019) — Discussion Board. “What will be done about data leaks?” Retrieved January, 2020 https://www.kaggle.com/c/ashrae-energy-prediction/discussion/116739
Kaggle. ASHRAE-Great Energy Predictor III (2019) — Discussion Board. “The building that consume energy before built.” Retrieved January, 2020. https://www.kaggle.com/c/ashrae-energy-prediction/discussion/113254
Kaggle. ASHRAE-Great Energy Predictor III (2019) — Discussion Board. “1st Place Solution Team Isamu & Matt.” Retrieved January, 2020. https://www.kaggle.com/c/ashrae-energy-prediction/discussion/124709
Konstantin Yakovlev. Kaggle Notebook,“ASHRAE — Data minification.” Retrieved January, 2020. https://www.kaggle.com/kyakovlev/ashrae-data-minification
Efficiency Valuation Organisation. International Performance Measurement and Verification Protocol. Available online: https://evo-world.org/en/products-services-mainmenu-en/protocols/ipmvp(accessed on 26 January 2020).

Estimating Counterfactual Energy Usage of Buildings with Machine Learning

Can we make ML models to predict a building’s energy usage? Absolutely!

1. Abstract

2. Background

3. Materials & Methods

Part 1. Divide Data by Unique Site ID’s

Part 2. Exploratory Data Analysis & Cleaning

resample(“H”).mean()

dropna()

fillna()

interpolate()

pad()

isnull()

Now we can see something!

Part 3. Feature Engineering, Feature Extraction, and Machine Learning.

feature_engineering()

Splitting the Data

Scaling the Data

Machine Learning with LightGBM

4. Results & Conclusions

Part 4. Evaluate All Models Together.

5. References

Written by Steven Smiley