Model Selection on Solid Ground: Comparison of Techniques to Evaluate Bayesian Evidence

Monday, 15 December 2014: 11:20 AM
Wolfgang Nowak, University of Stuttgart, Stuttgart, Germany, Anneli Schöniger, University of Tübingen, Tübingen, Germany, Luis E Samaniego, Helmholtz Centre for Environmental Research UFZ Leipzig, Leipzig, Germany and Thomas Wöhling, University of Tübingen, Water & Earth System Science Competence Cluster (WESS), Tübingen, Germany
Bayesian model averaging (BMA) ranks and averages a set of plausible, competing models, based on their fit to available data and based on their model complexity. BMA requires determining Bayesian model evidence (BME), which is the likelihood of the observed data integrated over each model parameter space. The BME integral is highly challenging, because it is as high-dimensional as the number of model parameters. Three classes of techniques are available to evaluate BME, each with its own challenges and limitations:
  1. Exact analytical solutions are fast, but restricted by strong assumptions.
  2. Brute-force numerical evaluation is accurate, but quickly becomes computationally unfeasible.
  3. Approximations known as information criteria (AIC, BIC, KIC) are known to yield contradicting results in model ranking.

We conduct a systematic comparison of available techniques to evaluate BME, including a list of numerical schemes. We highlight their common features and differences, and investigate their computational effort and accuracy. For the latter, we investigate the impact of (a) data set size and (b) overlap between the prior and the likelihood. We use a synthetic example with an exact analytical solution (as a first-time validation against a true solution), and a real-world hydrological application, where we use a brute-force Monte-Carlo method as benchmark solution.

Our results show that all IC differ drastically in their quality of approximation. From all IC, the KIC evaluated at the MAP performs best, but in general none of them is satisfying for non-linear model problems. Since they share the goodness-of-fit term, the observed differences imply an inaccurate penalty for model complexity. Our findings indicate that the choice of approximation method substantially influences the accuracy of the BME estimate and, consequently, the final model ranking and BMA results.