*This is part 4 of a series of introductory posts about the principles of climate modelling. Others in the series: 1 | 2 | 3*

In the previous post I said there will always be limits to our scientific understanding and computing power, which means that “all models are wrong.” But it’s not as pessimistic as this quote from George Box seems, because there’s a second half: “… but some are useful.” A model doesn’t have to be perfect to be useful. The hard part is assessing whether a model is a good tool for the job. So the question for this post is:

*How do we assess the usefulness of a climate model?*

I’ll begin with another question: what does a spam (junk email) filter have in common with state-of-the-art predictions of climate change?

The answer is they both improve with “Bayesian learning”. Here is a photo of the grave of the Reverend Thomas Bayes, which I took after a meeting at the Royal Statistical Society (gratuitous plug of our related new book, “Risk and Uncertainty Assessment for Natural Hazards”):

Bayesian learning starts with a first guess of a probability. A junk email filter has a first guess of the probability of whether an email is spam or not, based on keywords I won’t repeat here. Then you make some observations, by clicking “Junk” or “Not Junk” for different emails. The filter combines the observations with the first guess to make a better prediction. Over time, a spam filter gets better at predicting the probability that an email is spam: it learns.

The filter combines the first guess and observations using a simple mathematical equation called Bayes’ theorem. This describes how you calculate a “conditional probability”, a probability of one thing given something else. Here this is the probability that a new email is spam, given your observations of previous emails. The initial guess is called the “prior” (first) probability, and the new guess after comparing with observations is called the “posterior” (afterwards) probability.

The same equation is used in many state-of-the-art climate predictions. We use a climate model to make a first guess at the probability of future temperature changes. One of the most common approaches for this is to make predictions using many different plausible values of the model parameters (control dials): each “version” of the model gives a slightly different prediction, which we count up to make a probability distribution. Ideally we would compare this initial guess with observations, but unfortunately these aren’t available without (a) waiting a long time, or (b) inventing a time machine. Instead, we also use the climate model to “predict” something we already know, to make a first guess at the probability of something in the past, such as temperature changes from the year 1850 to the present. All the predictions of the future have a twin “prediction of the past”.

We take observations of past temperature changes – weather records – and combine them with the first guess from the climate model using Bayes’ theorem. The way this works is that we test which versions of the model from the first guess (prior probability) of the past are most like the observations: which are the most *useful*. We then apply those “lessons” by giving these the most prominence, the greatest weight, in our new prediction (posterior probability) of the future. This doesn’t guarantee our prediction will be correct, but it does mean it will be better because it uses evidence we have about the past.

Here’s a graph of two predictions of the probability of a future temperature change (for our purposes it doesn’t matter what) from the UK Climate Projections:

The red curve (prior) is the first guess, made by trying different parameter values in a climate model. The predicted most probable value is a warming of about three degrees Celsius. After including evidence from observations with Bayes’ theorem, the prediction is updated to give the dark blue curve (posterior). In this example the most probable temperature change is the same, but the narrower shape reflects a higher predicted probability for that value.

Probability in this Bayesian approach means “belief” about the most probable thing to happen. That sounds strange, because we think of science as objective. One way to think about it is the probability of something happening in the future versus the probability of something that happened in the past. In the coin flipping test, three heads came up out of four. That’s the past probability, the frequency of how often it happened. What about the next coin toss? Based on the available evidence – if you don’t think the coin is biased, and you don’t think I’m trying to bias the toss – you might predict that the probability of another head is 50%. That’s your belief about what is most probable, given the available evidence.

My use of the word belief might trigger accusations that climate predictions are a matter of faith. But Bayes’ theorem and the interpretation of “probability” as “belief” are not only used in many other areas of science, they are thought by some to describe the entire scientific method. Scientists make a first guess about an uncertain world, collect evidence, and combine these together to update their understanding and predictions. There’s even evidence to suggest that human brains are Bayesian: that we use Bayesian learning when we process information and respond to it.

The next post will be the last in the introductory series on big questions in climate modelling: *how can we predict our future?*

First my compliments on a rational, clear and unbiased discussion about the science of modeling. I am a physician and not a strong statistician but have done enough of a variety or research activities to be a danger to myself and others. At risk of raising obvious and oversimplified issues, it seems that a “good” climate model is selected by its ability to “predict” the past behavior of the system in question.

This seems analogous to clinical research that is either retrospective, or else in some fundamental way fails to meet the highest standard of randomized, double blinded, placebo controlled, prospective study. The latter standard is of course not at all sufficient to ensure truth emerges (many later revisions/reversals of just such high standard, “ground-breaking” clinical discoveries are found throughout the literature) but rather just an attempt to lessen the very high risk that a finding is the result of bias or manipulation rather than a true natural relationship.

With the climate models it appears there are a myriad of parameters and assumptions one can build in, and multiple “dials” one may fiddle to test the sensitivity to one condition or another, and I believe there is ample evidence that one can generally develop models of many different forms that will mimic an already known reality. Thus if one feels that anthropogenic CO2 is a prime driver of recent warming one would naturally test models in which this is the main parameter of interest and eventually with enough twist and turns some of those models will meet the criteria of an accurate hindcast.

But that success doesn’t mean that another individual who feels strongly that natural ocean cycles, solar cycles, aerosols, or some other natural phenomenon is a primary driver, could not create equally successful models based on different assumptions. Neither would prove the true main drivers of climate change, but only create testable hypotheses which, within the models, can undergo mathematical sensitivity analysis to various parameter changes.

Am I accurately describing this or missing some important logic that gives the models more validity? If I am not too far off base then I come to the conclusion that an accurate hindcast is a requirement to consider a climate model worthy of further investigation, but it is not a validation of that model’s ability to predict future climate. Rather the hypothesis of the model can only be validated by success in future predictions once they can be compared to measured reality. I am very happy to be corrected and educated on this if, as is quite possible, I am lacking in comprehension.

Interestingly, and as someone who is significantly involved in (not climate) modelling, my main exposure to Bayesian techniques is in measurement, not modelling! The reason is that what you actually observe (charge read out from some electronics, RNA levels, layering in ice cores…) can be quiet different to what you would *like* to observe (high energy particles, protein expression, historical temperature…). So what you have to do is say “given I observe X, what is the probability that my quantity of interest had value Y.” Naturally, that mapping from X->Y very often has a model dependence itself, so you want to a) design your measurement to make the model dependence as small as possible and b) nail down the model features that remain important to increase the confidence in your measurement.

So the important thing from this point of view would be things like models of tree-ring width Vs. temperature and precipitation. Less sexy than modelling the future climate perhaps, but also very important to the whole endeavour, I would guess!

Arrgh, don’t mention reconstructing temperature from tree rings or we’ll be stuck here in the comments forever! 😉 A whole post, or series of posts, to do those justice I think.

Yes, Bayesian techniques are everywhere. In fact I’m part of a group called SUPRAnet that is encouraging the use of Bayesian modelling for palaeoclimate reconstructions (oh no, I mentioned them again…).

Well I look forward to that series of posts then It really gets to the core (excuse the pun) of what it is to know something.

Models are tuned by hindcasting for the “best estimate” predictions used in the groups of simulations from different climate models. This is the “multi-model ensemble”.

But they are

detunedto get probability distributions for estimating uncertainty in predictions from a given climate model – like the UK Climate Projections linked in the post, or ClimatePrediction.Net. These are “perturbed parameter ensembles”.Both are complementary approaches and incorporate different aspects of uncertainty. The ideal would be to have a perturbed parameter ensemble for each of the multiple models, but we haven’t got there yet. So people compare and combine results from both types of ensemble.

Gratuitous plug #2: you can now buy tickets to see Jonathan and I discuss the usefulness of climate models in Cheltenham (UK) on the 7th June: Can we trust climate models?

So, given that the models are tuned by hindcasting, this means that success in hindcasting cannot be used as evidence of model skill?

It’s true that success in hindcasting is necessary but not sufficient for assessing the skill of a model. There’s no guarantee that good parameter values for the present day are good for the future. We partially address this by checking success of different parameter values for several different climate states in the past (e.g. my previous research, as yet unpublished except these conference proceedings and various conference abstracts).

For assessing model uncertainty with perturbed parameter ensembles, the original model tuning doesn’t enter into it except through (a) being the central-ish values around which different values are tried, and (b) any influence on model development, i.e. if it affects choices about the structure of the model while they are writing it.

To summarise the summary of the summary, models are tuned by hindcasting?

The models tend to be tuned to reproduce the climatological state (and variability somewhat), rather than the time-evolving 20th century. I think this point is important. So, the ‘hindcast’ of the 20th century is a different simulation which is not used for tuning. Anyway, that’s my understanding.

Ed.

For our readers: the climatological state is essentially the mean climate over a particular period, such as 1980-1999 or 1961-1990. It could be an average over every day of the year for these years, or an average for each season separately, or each month, and so on. You could also include the variation around that mean as part of the “climatology” information.

Yes, sorry Ed, I’m sure you’re right in most cases, though UK Climate Projections used both for their probability update:

Their recent climate observations are multi-year means (can’t find the date range right now) of each season for:

Their four time-varying observations are:

(sorry for jargon)

Source: UK Climate Projections 2009, Chapter 2

Hi Tamsin,

I think there’s potential in this comment string for causing some misunderstanding about “tuning”, in that different phenomena are being referenced by that same word.

Looking at the UK Climate Projections text they appear to have conducted statistical

posterior“tuning” (I think typically it’s termed “scaling” or “weighting”, which are perhaps less ambiguous words in this context) by taking already-existing model runs and comparing to observations. This type of study usually does involve scaling/weighting/tuning by comparing changes over time – observational fit to a hindcast.The type of tuning to which Ed and Jonathan refer is model development tuning performed

priorto outputting the model hindcast/historical runs, say for CMIP5. Ed points out that this type of tuning is conducted by reference to climatological states rather then trends.Hi Jonathan,

Ed Hawkins is more of an expert that I am. And perhaps one has to be a bit careful about how the models are actually tuned, and how they are officially tuned. But you can get a clear statement of the latter for the CCSM4 model, see DOI:10.1175/2011JCLI4083.1

In this paper (sec 3) the authors state that each of the modules of the climate model (ocean, atmosphere, land, sea ice) is individually tuned using observations to provide boundary conditions. Then the four modules are coupled, and the model as a whole is tuned to Top Of the Atmosphere (TOA) radiation using a cloud parameter, and Arctic sea ice using sea ice albedo parameters. That’s a very limited amout of whole-model tuning, in keeping with the modellers’ stated desire to minimise compensatory mis-tuning.

A climate model feature such as C20th global mean temperature is an emergent feature of the interaction of the four modules, and so the CCSM model’s performance on this output is a diagnostic — it has not been explicitly tuned to. The same goes for other features, such as getting the frequency of the El Nino Southern Oscillation about right; and getting the dual-lobed structure of the ITCZ wrong.

Of course with our suspicious hats on we cannot rule out the possibility that the process has been looped once or twice after inspecting C20th global mean temperature, but the expense of model runs prevents much looping. These very large models run at about 100 model years per calendar month, so an 1850-today run will take about 2 calendar months. This is not the kind of target that can be built into a tuning exercise, especially with an IPCC deadline looming.

Best wishes, Jonty.