Weather forecasts can be improved both by blending data from multiple numerical weather prediction models and through direct edits of model data made by human forecasters.

However, whether a forecast is considered improved depends entirely on our definition of “improvement”, and the precise metrics used to assess such improvements. Furthermore, because human forecasters edit model data for many different reasons, it can be difficult to assess whether specific edits, aimed at improving the representation of specific physical processes, have been successful or not

Two such specific processes, that both Australian and U.S. forecasters edit model data to better resolve, are the land-sea breeze, and boundary layer mixing. These processes repeat every day, and are somewhat predictable. The land-sea breeze can affect the venting of air pollution, and boundary layer mixing can potentially affect the function of wind turbines.

In this paper, CLEX researchers developed new metrics to assess whether forecaster edits targeting these processes were reducing error in the daily varying component of the wind forecasts, by comparing edited and unedited forecast data with weather station observations. 

The results showed that when winds are considered at individual stations, or averaged over small spatial scales like that of an individual city, the Bureau’s official, human edited forecast dataset generally exhibits larger errors than unedited model data (remembering that only errors in the daily varying component of the wind forecast are being considered, not errors in the overall wind fields.) Interestingly however, the human edited forecast can occasionally produce lower errors than the blended, ensemble average forecast, because ensemble averaging overly smooths the daily varying component of the wind field.

Interpreting these results requires nuance because at small spatiotemporal scales winds are very chaotic, and more realistic representations of chaotic processes in forecasts do not necessarily translate into reduced errors – indeed, they often make errors worse.    

You can develop some intuition for this by considering two unbiased six-sided dice, one red and one blue. Imagine the red dice represents “observations”, and the blue dice represents a “forecast”, and that they are identical except for colour. If you roll the dice multiple times and compare the numbers, you can work out the average error between your red observations-dice, and your blue forecast-dice (noting that the error between two numbers is just the distance between them, i.e. their difference ignoring negative signs.)

Now imagine you also have a blue ball with the number 3.6 written on it, and that rolling it provides an alternative forecast to rolling your blue dice. If you roll your blue forecast-ball and red observations-dice enough times, you will find the average error is lower for the ball (about 1.32) than with the dice (about 1.6), even though the red and blue dice are identical except for the colour.
Thus, the blue forecast-ball gives a lower error than the blue forecast-dice, even though the ball is not a “realistic” representation of the observations-dice at all.

In the case of the land-sea breeze and boundary layer mixing edits considered in this paper, forecasters may be justifiably expending effort to make the wind field more realistic, even if doing so does not decrease error. It’s issues like these that catalysed the field of fuzzy verification, which attempts to develop novel ways of assessing forecasts of chaotic processes, such as rolling dice, or the weather.

A very simple approach, known as “upscaling”, is used to first average data over a spatial or temporal scale where chaotic behaviour is smoothed out. In the dice example, we might compare the average numbers rolled on each dice and the ball. Both the blue forecast-dice and red observations-dice roll an average of 3.5 – a perfect agreement.

However, the blue forecast-ball gives 3.6 every time, so it’s average is 3.6, and therefore in this situation, it performs worse than the blue forecast-dice.

In the case of the land-sea breeze and boundary layer mixing edits considered in this paper, the results show that when “upscaled” to very large spatial scales, such as that of an entire Australian State’s coastline, the edited forecast can occasionally produce lower errors in the daily varying component of the wind field than commonly used models, as well as the Bureau’s blended ensemble product. 

Another approach to fuzzy verification involves assessing qualitative “features” within a forecast. In the dice example, we might ask, how frequently is the number 1 rolled? The red and blue dice would again be in perfect agreement, rolling the number one, one-sixth of the time. However, the blue forecast-ball will obviously never roll the number one.

Similarly, in this paper, methods are developed to assess qualitative features of land-sea breeze and boundary layer mixing processes, such as the average direction of the land-sea breeze at its peak, and the average time of day when this peak is attained. This allows qualitative issues with how land-sea breeze and boundary layer mixing processes are represented in forecasts to be quickly diagnosed in different regions.

For example, results indicate direction biases in the land-sea breeze around Brisbane in many commonly used models, as well as in the Bureau’s official edited forecast, and in the Bureau’s blended ensemble product, with the blended product also significantly suppressing the amplitude of the land-sea breeze at this location.

It is essential to note that all the results of this paper depend entirely on the metrics and definitions chosen and that no claim is being made that any single metric can capture the overall accuracy, or value to the user, of a given forecast. Whether we prefer a forecast-dice or a forecast-ball depends entirely on the application.

Future work could extend these ideas by considering how specific industries or groups consume Bureau forecasts and designing verification metrics subject to their specific needs.