Testing the SUPERFLEX Modelling Framework for basin classification purposes: an analysis of performance measures.
Abstract:
Recently, Kavetski & Fenicia (2011) and Fenicia et al. (2011; 2013) have demonstrated the potential of flexible model structures in the process of hypothesis testing. They showed that basins with distinct hydrological dynamics appear best-characterized using distinctly different lumped model structures. Therefore, flexible hydrological models can be used as tools to elucidate the potential correspondence between basin structure and model structure with a view to basin classification. The flexible hydrological models are perceived as basin classifiers, when we regard each model as an idiosyncratic classification algorithm.Although basin classification makes use of hydrological models, the use of flexible model structures as basin classifiers is new. The SUPERFLEX Modelling Framework (Kavetski & Fenicia (2011) and Fenicia et al. (2011)) supports serial, linear and parallel structures with different numbers of reservoirs and parameters. It covers a relatively broad range of conceptual model complexities, which starts with a simple one-reservoir-structure and gradually increases by adding reservoirs, power- and lag-functions to more complex structures.
The use of several flexible model structures as basin classifiers implies the identification of a best performing model for a catchment. However, there is no ideal performance measure. A basin classification based on best performing models is directly connected to the quality and explanatory power of the used performance measure too. Thus, the suitability of frequently used performance measures to identify a best performing model for single basins is to be tested. Measures, derived from the Flow Duration Curve (FDC), can form an alternative to the classical ones.
The FDC is the complement of the cumulative distribution function of stream flow. The concordance between an observed FDC and a simulated FDC is a powerful measure of modelling performance. Yilmaz et al. (2008) propose the concept of comparing simulated and observed runoff time series with a couple of signature indices defining the agreement between an observed FDC and a simulated FDC for defined parts of the FDC. We use 4 signature indices representing the whole range of the FDC from very high to low flow.
The aim of this study is to test, which performance measures are suitable to identify a best performing model for single basins out of 12 models of the SUPERFLEX Modelling Framework (Kavetski & Fenicia (2011) and Fenicia et al. (2013)).
In order to find effective measures, we follow a three-step approach:
- Calibrate all 10 basins with the 12 models of the SUPERFLEX Modelling framework
- Calculate various performance measures for all model realizations (i.e. 120 calibrated models)
- Analyze the performance measures per basin and per model with a view to redundancies, explanatory power, clearness and usability.
We base our case study on 10 small to medium-sized gauged basin areas in Rhineland-Palatinate, Germany covering different landscapes. Basin sizes vary from 39 km2 to 681 km2. Elevation ranges from approx. 100 m a.s.l. for the Rhine valley up to 818 m a.s.l. for the Hunsrueck middle mountain range. All basins are rural with little urbanization.
For the data-driven modelling method, hourly runoff, areal precipitation and temperature data for the period from January 1996 to December 2003 are available. These time series cover a wide range of diverse annual or seasonal precipitation and runoff events.
Methods
The comparison of observed and simulated runoff time series by means of a performance measure allows the identification of a best performing model for a basin. However, there are many different kinds of performance measures, all with strengths, weaknesses and sensitivities to parts of the hydrograph. Here we test three types of performance measures:
1. Classical statistical performance measures:
- Root Mean Square Error
- Pearson R2
- Weighted R2
- Spearman's rank correlation coefficient
2. Hydrological performance measures:
- Nash Sutcliffe Efficiency
- Modified NS Efficiency
- Index of Agreement
- Modified Index of Agreement
- Kling-Gupta-Efficiency
- Volumetric Efficiency
3. Signature Indices from the FDC:
- FHV: very high flow, flow exceeding probability < 2%
- FMV: high flow, flow probability 2 - 20%
- FMS: slope of the mid-segment FDC, flow probability 20 - 70%
- FLV: low flow, flow exceeding probability > 70%
The classical statistical and hydrological performance measures compare the entire observed and simulated hydrograph and mainly describe the overall performance of a model. Depending on their calculation formula, they are often sensitive to special parts of the hydrograph. By contrast, the signature indices evaluate parts of the FDC and express the biases between observed and simulated FDCs.
Results
The classical and hydrological performance measures show related patterns between model performance and model structure for the models 1 to 12. For all studied basins, the models 1, 2, 8, 9 and 10 show a bad performance, while the models 4, 5, 6, 7, 11 and 12 have an overall better performance. Fig. 1a depicts this pattern of model performances for one basin. The patterns for the other basins look very similar. Models 5 and 6 and respectively 12 have the same model concept as model 4 and model 11 respectively with additional features. These extra features are not decisive for some basins and therefore very often lead to similar simulations. Only the performance of model 3 shows good and bad performances depending on basin properties. Basins with less precipitation show mostly a bad performance for model 3, whereas for comparatively wet basins the performance of this model is good. There is no substantial difference of statements between the classical statistical and the hydrological performance measures.
For most of the basins, several of the overall better performing models have almost identical values for the various performance measures. Frequently we obtain similar performances (differences < 0.05) for the models 3, 4, 5, 6, 7, 11 and 12. There are no substantial differences between the various performance measures with respect to the good performing models. None of them is able to identify clearly a best performing model per basin, although the models have different model concepts and structures. Similar performance measures for different models do not necessarily describe similar simulations. These effects obstruct the identification of a decidedly best performing model for a single basin when we base this on classical statistical performance measures or hydrological performance measures.
The signature indices from the FDC compare certain parts of a hydrograph that have a defined probability of occurrence. As for the other performance measures, they indicate a better agreement for models 3, 4, 5, 6, 7, 11 or 12 than for the other models. However, they differentiate on model simulations with similar statistical or hydrological performance measures. The single indices show a particular value that enable us to see differences in performance for parts of the FDC and therefore of the hydrograph. These particular values evaluate the performance for parts of the hydrograph or, as combined values, for the whole simulation.
For the studied basins, no model structure covers the entire FDC appropriately. The signature indices show large differences between and within models from very high to acceptable bias values. The combination of acceptable values of the signature indices and the comparison with the values from other models allows for a clear identification of the best performing model, either through overall lower biases or through sums of (weighted) biases.
Fig. 1a depicts for all performance measures an example with good performances for models 3, 4, 5, 6, 7, 11 and 12. Due to mostly similar values per measure for these models, it is hard to decide which model performs best. The FDC (Fig. 1b) indicate different simulations for each model. The signature indices express the degree of deviation and clarify the quality of simulations. The signature indices of model 7 do have the best performances except for FLV. The good performance for FLV of model 11 is not able to offset its large values for FHV and FMV. A combination of the signature indices of these four models shows for model 7 the lowest signature indices for most parts of the FDC. Therefore, model 7 is the best performing model for this basin.
The classical statistical performance measures and the hydrological performance measures identify model 7 as best performing too. From these eight measures, two show the same performance for models 5, 6 and 12 as for model 7. All performance measures have almost similar values with a difference of less than 0.05 for at least two further models when compared to the best performing model.
Conclusion
This study shows the importance of an appropriate selection of performance measures to identify a best performing model. In contrast to the classical statistical performance measures and hydrological performance measures, with signature indices it is possible to identify a definitely best performing model, which is the main objective of this study. Furthermore, signature indices are indicative of the error type, and allow extensive insights into the discrepancies between the observed and the simulated runoff. This is useful for studies with a focus on special parts of the hydrograph.
Since FDCs do not consider the timing, the signature values are not indicative to timing errors. Therefore, the signature indices will detect a generally good simulation with a timing error instead of bad performances indicated by the other performance measures.
However, the evaluation of an overall performance is more complex, especially when single indices display different performances for parts of a FDC. To overcome this problem we need to develop a combined measure from the signature indices. Such a combination leads to a numerical performance measure with differentiated values. This allows for and simplifies the identification of a best performing model, especially when a large number of basins are involved. Finally, the analysis of signature biases is a helpful method for further model development. Moreover, it would be helpful to identify spatial patterns and basin properties of best performing models, based on the proposed combined signature index.
The SUPERFLEX Modelling Framework offers a sound basis for basin classification with flexible model structures, when we define appropriate performance measures. To identify a best performing model for a single basin, we recommend the use of signature indices that are based on the bias between the observed and the simulated runoff.
References
Fenicia F., Kavetski D., Savenije H.H.G.:Elements of a flexible apporach for conceptual hydrological modeling: 1. Motivation and theoretical development. Water Resources Research, 47, 2011.
Fenicia F., Kavetski D., Savenije H.H.G., Clark M.P., Schoups G., Pfister, L. Freer J.:Catchment properties, function, and conceptual model representation: ist there a correspondence? Hydrological Processes, 2013.
Kavetski D., Fenicia F.:Elements of a flexible approach for conceptual hydrological modeling: 2. Application and experimental insights. Water Resources Research, 47, 2011.
Yilmaz K., Gupta H.V., Wagener T.: A process-based diagnostic approach to model evaluation: Application to the NWS distributed hydrologic model. Water Resources Research, 44(9), 2008.