Normalized metered energy consumption (NMEC) methods offer many benefits to energy efficiency utility programs and their participants, but care must be used. NMEC programs rely on baseline models for pre-screening and energy savings calculations following implementation. The results of each of these steps determine program eligibility and the size of a participant’s incentive check. So, it’s vital that methods are used with good judgement and understanding of the statistical principles and key model metrics behind them.
The example presented below demonstrates that current technical guidance should serve only as guidance and not be taken as criteria that determines a site “passing or failing” the NMEC screening process. In this post, I will walk you through an example building dataset where goodness-of-fit thresholds for three key model metrics are met but the baseline model is not sufficient to ensure fair and equitable project outcomes for all stakeholders. Focusing too narrowly on key model metrics, without accounting for the overall context, can lead to poorly performing projects and missed opportunities, especially in the pay-for-performance programs.
If you’re curious about the plots presented below, check out the interactive environment where you can review the code used to generate them.
Background of NMEC in Energy Efficiency Programs
Typical (non-NMEC) efficiency programs pay incentives based on estimated or deemed energy saving values, whereas NMEC programs offer pay-for-performance incentives. This creates an undeniable incentive for NMEC program participants to maximize actual energy savings. However, since NMEC baseline models calculate drive both project eligibility and energy savings using, it is imperative that models are as accurate as possible.
To participate in an NMEC program, a building’s energy use must be predictable, within an acceptable degree of certainty. During pre-screening for a site-level NMEC program, a predictability analysis is performed wherein candidate baseline models are screened and vetted to ensure they can adequately characterize a site’s energy use. For buildings with regular operation, like grocery stores or university campus buildings, this can usually be accomplished. However, if models cannot characterize building energy use with enough accuracy, they cannot participate in the program.
Site Screening
The screening process relies on three key model metrics:
CV(RMSE): measure of random error between the model fit and the actual data
NMBE: measure of the total bias error between the model fit and the actual data
R2: measure of the amount of variation in the energy use explained by the model
The NMEC Rulebook references the requirements listed in the LBNL Technical Guidance to demonstrate feasibility of NMEC approaches on target buildings. The guidance provides the following three thresholds for model goodness-of-fit:
CV(RMSE) < 25%
NMBE between -0.5% and +0.5%
R2 > 0.7
As with almost all thresholds and magic numbers, these criteria have taken a life of their own and are often referenced and evaluated without context when screening projects for NMEC programs. This issue is of prime importance in pay-for-performance programs where the implementors’ payments depend on these baseline models’ accuracy to characterize the building energy use.
Predictability Analysis
A predictability analysis usually begins by charting the energy use against time and temperature to visualize energy use patterns and identify major non-routine events that would make the building clearly ineligible for an NMEC program.
Here we have energy use data from a commercial facility in southern California:
The temperature data is plotted in blue on the right-axis, and the energy use data is plotted in green on the left-axis. The two data streams are plotted as a scatterplot below:
Here is a zoomed-in version of the scatterplot:
Given the shape of the scatter plot, a four-parameter model seems appropriate. However, as seen in the time-series chart, the energy use is also dependent on the time of use: weekday vs weekend.
In the next section, the time-of-week & temperature and the four-parameter algorithms are assessed for this data.
Candidate Model Screening
Algorithm | R2 | Adj. R2 | CV(RMSE) % | NDBE % | NMBE % |
4P | 0.78 | 0.78 | 9.09 | -2.41e-16 | -2.43e-16 |
TOWT | 0.80 | 0.79 | 8.67 | -2.75e-14 | -2.85e-14 |
As per the three thresholds, both models demonstrate feasibility for an NMEC approach. The R2 value is above 0.7, the CV(RMSE) is well below 25%, and NMBE is near zero.
The guidance, however, also notes that “…these (thresholds) do not comprise pass/fail criteria, but rather an analysis that will feed into an interpretation of model suitability”.
This statement is very important but is overshadowed by the threshold criteria which serve as heuristics, often used in place of a comprehensive analysis rather than with it. A comprehensive analysis involves looking beyond the minimum 12 months of data and assessing the complete energy use behavior of the site. Limiting our focus to 12 months leaves us oblivious to slow moving overall trends or NREs outside the 12-month period.
One such NRE is the onset of the COVID-19 pandemic. The shelter-in-place restrictions first went into effect in California on March 19th, 2020. The dataset we have here is completely within the shutdown period, and it is hard for me to imagine a facility that has continued to operate similarly during the pandemic as it did prior to the shutdown.
A well-known and widely trusted way to check for this behavior is to conduct model testing on a dataset different from the one used to develop and train the model. We take another 12 months of data from the same building and test the model’s ability to predict the energy use on this “new” dataset. The only requirement of this methodology is that the building operations and energy use should be similar between the two datasets. In this post, I’ve chosen an additional 12 months of data to test the model. The testing period’s length is context driven and may be lengthened or shortened based on data availability and other factors. The Efficiency Valuation Organization (EVO) uses this methodology[1] to evaluate the accuracy of M&V tools against the benchmarked public-domain tools.
In the next section, I apply this process to the candidate models. I have another year of data which I can use to assess the model’s skill in capturing the underlying trend of the site’s energy consumption profile:
Expanded Candidate Model Screening (Predictive Accuracy)
The models were built on the dataset: April, 2020 – April, 2021. The NMEC Rulebook requires that the time between the end of the baseline period and the completion of the project implementation stage should not exceed 18 months. To comply with this requirement, I am using the data from April, 2019 – April, 2020 to evaluate the models’ predictive accuracy. The scatter plot below shows the actual energy use in green and the model predictions in red and blue.
Both models’ predictions are much lower than the actual energy use of the building prior to the COVID-19 related shutdowns. This result is expected as most buildings emptied out during the pandemic and consequently, had much lower energy use. As the economy reopens, these buildings will be reoccupied and energy use will go back up.
If we were to use the models built on the 2020-2021 dataset, the adjusted baseline may be much lower than actual energy use after project implementation, and the project would appear to have negative savings. Narrowly focusing on the three thresholds would, therefore, lead to poor outcomes for pay-for-performance programs and large amounts of losses for the implementers.
Additional Independent Variables
After the onset of COVID-19, building occupancy has become as important a predictor as time-of-week and temperature in determining the general trend of energy use profiles. While not all buildings have the luxury of tracking occupancy through key-card swipes, wi-fi connections, or room occupancy, those that do have an easier solution to the problem.
The prediction errors for the four-parameter model and the TOWT model above was 12.92% and 12.23%, respectively. After adding in occupancy data to the model, the prediction errors were reduced to 4.25% and 4.07%, respectively. The scatter plot below shows the updated predictions:
And the updated goodness-of-fit metrics are:
Algorithm | R2 | Adj. R2 | CV(RMSE) % | NDBE % | NMBE % |
4P + occupancy | 0.86 | 0.86 | 7.62 | -9.06e-17 | -9.17e-17 |
TOWT + occupancy | 0.88 | 0.87 | 7.23 | -3.92e-14 | -4.09e-14 |
Note that the two models have almost identical goodness-of-fit metrics. The occupancy variable provides information the model that the TOWT tries to glean from the data internally. In this scenario, I would go with the simpler algorithm to develop the baseline model for this site’s M&V Plan.
Summary
Testing a model’s ability to predict data that it has not seen before is analogous to sitting for an exam. You can get the top score on an assessment if you are free to look through the answer key. Similarly, a model can perform very well on the dataset it is trained on and present low goodness-of-fit metrics. This is especially true for the advanced machine-learning algorithms that are built to minimize the model error. If left unchecked, many of these advanced algorithms begin to model the noise in the data and can present a false sense of high accuracy to the analyst. The technical term for this behavior is over-fitting.
Your knowledge of the subject matter is truly tested when you give the exam and don’t have the answers readily available. And similarly, a model’s ability to produce reliable predictions, which can be used to correctly quantify project savings, is truly assessed using a new dataset that is different from the one the model was trained on, but is similar enough such that the data profiles are consistent with the training dataset. After all, you want the exam to cover the material that you studied for.
A robust model is one that is built using all influential data variables and is able to produce reliable predictions on an ‘unseen’ dataset. Limiting its characteristics to the three goodness-of-fit metrics will corner us into an uncomfortable spot with a false sense of security and high exposure to financial as well as environmental risk.
Get Interactive with the Code!
Remember, if you’re interested in learning more about the code used to generate the plots presented in this post, go to this interactive environment on binder.
[1] https://mvportal.evo-world.org/