cc: "Thomas.R.Karl" <Thomas.R.Karl@noaa.gov>,  carl mears <mears@remss.com>, "David C. Bader" <bader2@llnl.gov>,  "'Dian J. Seidel'" <dian.seidel@noaa.gov>, "'Francis W. Zwiers'" <francis.zwiers@ec.gc.ca>,  Frank Wentz <frank.wentz@remss.com>, Karl Taylor <taylor13@llnl.gov>,  Leopold Haimberger <leopold.haimberger@univie.ac.at>, Melissa Free <Melissa.Free@noaa.gov>,  "Michael C. MacCracken" <mmaccrac@comcast.net>, "'Philip D. Jones'" <p.jones@uea.ac.uk>,  Sherwood Steven <steven.sherwood@yale.edu>, Steve Klein <klein21@llnl.gov>, 'Susan Solomon' <susan.solomon@noaa.gov>,  "Thorne, Peter" <peter.thorne@metoffice.gov.uk>, Tim Osborn <t.osborn@uea.ac.uk>, Tom Wigley <wigley@cgd.ucar.edu>,  Gavin Schmidt <gschmidt@giss.nasa.gov>
date: Wed, 26 Dec 2007 18:50:55 -0800
from: Ben Santer <santer1@llnl.gov>
subject: More significance testing stuff
to:  John.Lanzante@noaa.gov

<x-flowed>
Dear John,

Thanks for your email. As usual, your comments were constructive and 
thought-provoking. I've tried to do some of the additional tests that 
you suggested, and will report on the results below.

But first, let's have a brief recap. As discussed in my previous emails, 
I've tested the significance of differences between trends in observed 
MSU time series and the trends in synthetic MSU temperatures in a 
multi-model "ensemble of opportunity". The "ensemble of opportunity" 
comprises results from 49 realizations of the CMIP-3 "20c3m" experiment, 
performed with 19 different A/OGCMs. This is the same ensemble that was 
analyzed in Chapter 5 of the CCSP Synthesis and Assessment Product 1.1.
I've used observational results from two different groups (RSS and UAH). 
 From each group, we have results for both T2 and T2LT. This yields a 
total of 196 different tests of the significance of 
observed-versus-model trend differences (2 observational datasets x 2 
layer-averaged temperatures x 49 realizations of the 20c3m experiment). 
Thus far, I've tested the significance of trend differences using T2 and 
T2LT data spatially averaged over oceans only (both 20N-20S and 
30N-30S), as well as over land and ocean (20N-20S). All results 
described below focus on the land and ocean results, which facilitates a 
direct comparison with Douglass et al.

Here was the information that I sent you on Dec. 14th:

COMBINED LAND/OCEAN RESULTS (WITH STANDARD ERRORS ADJUSTED FOR TEMPORAL 
AUTOCORRELATION EFFECTS; SPATIAL AVERAGES OVER 20N-20S; ANALYSIS PERIOD 
1979 TO 1999)

T2LT tests, RSS observational data: 0 out of 49 model-versus-observed 
trend differences are significant at the 5% level.
T2LT tests, UAH observational data: 1 out of 49 model-versus-observed 
trend differences are significant at the 5% level.

T2 tests, RSS observational data: 1 out of 49 model-versus-observed 
trend differences are significant at the 5% level.
T2 tests, UAH observational data: 1 out of 49 model-versus-observed 
trend differences are significant at the 5% level.

In other words, at a stipulated significance level of 5% (for a 
two-tailed test), we rejected the null hypothesis of "No significant 
difference between observed and simulated tropospheric temperature 
trends" in only 1 out of 98 cases (1.02%) for T2LT and 2 out of 98 cases 
(2.04%) for T2.

You asked, John, how we might determine a baseline for judging the 
likelihood of obtaining the 'observed' rejection rate by chance alone. 
You suggested use of a bootstrap procedure involving the model data 
only. In this procedure, one of the 49 20c3m realizations would be 
selected at random, and would constitute the "surrogate observations". 
The remaining 48 members would be randomly sampled (with replacement) 49 
times. The significance of the difference between the surrogate 
"observed" trend and the 49 simulated trends would then be assessed. 
This procedure would be repeated many times, yielding a distribution of 
rejection rates of the null hypothesis.

As you stated in your email, "The actual number of hits, based on the 
real observations could then be referenced to the Monte Carlo 
distribution to yield a probability that this could have occurred by 
chance."

One slight problem with your suggested bootstrap approach is that it 
convolves the trend differences due to internally-generated variability 
with trend differences arising from inter-model differences in both 
climate sensitivity and in the forcings applied in the 20c3m experiment. 
So the distribution of "hits" (as you call it; or "rejection rates" in 
my terminology) is not the distribution that one might expect due to 
chance alone.

Nevertheless, I thought it would be interesting to generate a 
distribution of "rejection rates" based on model data only. Rather than 
implementing the resampling approach that you suggested, I considered 
all possible combinations of trend pairs involving model data, and 
performed the paired difference test between the trend in each 20c3m 
realization and in each of the other 48 realizations. This yields a 
total of 2352 (49 x 48) non-identical pairs of trend tests (for each 
layer-averaged temperature time series).

Here are the results:

T2: At a stipulated 5% significance level, 58 out of 2352 tests 
involving model data only (2.47%) yielded rejection of the null 
hypothesis of no significant difference in trend.

T2LT: At a stipulated 5% significance level, 32 out of 2352 tests 
involving model data only (1.36%) yielded rejection of the null 
hypothesis of no significant difference in trend.

For both layer-averaged temperatures, these numbers are slightly larger 
than the "observed" rejection rates (2.04% for T2 and 1.02% for T2LT). I 
would conclude from this that the statistical significance of the 
differences between the observed and simulated MSU tropospheric 
temperature trends is comparable to the significance of the differences 
between the simulated 20c3m trends from any two CMIP-3 models (with the 
proviso that the simulated trend differences arise not only from 
internal variability, but also from inter-model differences in 
sensitivity and 20th century forcings).

Since I was curious, I thought it would be fun to do something a little 
closer to what you were advocating, John - i.e., to use model data to 
look at the statistical significance of trend differences that are NOT 
related to inter-model differences in the 20c3m forcings or in climate 
sensitivity. I did this in the following way. For each model with 
multiple 20c3m realizations, I tested each realization against all other 
(non-identical) realizations of that model - e.g., for a model with an 
20c3m ensemble size of 5, there are 20 paired trend tests involving 
non-identical data. I repeated this procedure for the next model with 
multiple 20c3m realizations, etc., and accumulated results. In our CCSP 
report, we had access to 11 models with multiple 20c3m realizations. 
This yields a total of 124 paired trend tests for each layer-averaged 
temperature time series of interest.

For both T2 and T2LT, NONE of the 124 paired trend tests yielded 
rejection of the null hypothesis of no significant difference in trend 
(at a stipulated 5% significance level).

You wanted to know, John, whether these rejection rates are sensitive to 
the stipulated significance level. As per your suggestion, I also 
calculated rejection rates for a 20% significance level. Below, I've 
tabulated a comparison of the rejection rates for tests with 5% and 20% 
significance levels. The two "rows" of "MODEL-vs-MODEL" results 
correspond to the two cases I've considered above - i.e., tests 
involving 2352 trend pairs (Row 2) and 124 trend pairs (Row 3). Note 
that the "OBSERVED-vs-MODEL" row (Row 1) is the combined number of 
"hits" for 49 tests involving RSS data and 49 tests involving UAH data:

REJECTION RATES FOR STIPULATED 5% SIGNIFICANCE LEVEL:
    Test type              No. of tests     T2 "Hits"       T2LT "Hits"

Row 1. OBSERVED-vs-MODEL     49 x 2         2  (2.04%)     1  (1.02%)
Row 2. MODEL-vs-MODEL        2352          58  (2.47%)    32  (1.36%)
Row 3. MODEL-vs-MODEL         124           0  (0.00%)     0  (0.00%)

REJECTION RATES FOR STIPULATED 20% SIGNIFICANCE LEVEL:
    Test type              No. of tests     T2 "Hits"       T2LT "Hits"

Row 1. OBSERVED-vs-MODEL     49 x 2         7  (7.14%)     5  (5.10%)
Row 2. MODEL-vs-MODEL        2352         176  (7.48%)   100  (4.25%)
Row 3. MODEL-vs-MODEL         124           8  (6.45%)     6  (4.84%)

So what can we conclude from this?

1) Irrespective of the stipulated significance level (5% or 20%), the 
differences between the observed and simulated MSU trends are, on 
average, substantially smaller than we might expect if we were 
conducting these tests with trends selected from a purely random 
distribution (i.e., for the "Row 1" results, 2.04 and 1.02% << 5%, and 
7.14% and 5.10% << 20%).

2) Why are the rejection rates for the "Row 3" results substantially 
lower than 5% and 20%? Shouldn't we expect - if we are only testing 
trend differences between multiple realizations of the same model, 
rather than trend differences between models - to obtain rejection rates 
of roughly 5% for the 5% significance tests and 20% for the 20% tests? 
The answer is clearly "no". The "Row 3" results do not involve tests 
between samples drawn from a population of randomly-distributed trends! 
If we were conducting this paired test using randomly-sampled trends 
from a long control simulation, we would expect (given a sufficiently 
large sample size) to eventually obtain rejection rates of 5% and 20%. 
But our "Row 3" results are based on paired samples from individual 
members of a given model's 20c3m experiment, and thus represent both 
signal (response to the imposed forcing changes) and noise - not noise 
alone. The common signal component makes it more difficult to reject the 
null hypothesis of no significant difference in trend.

3) Your point about sensitivity to the choice of stipulated significance 
level was well-taken. This is obvious by comparing "Row 3" results in 
the 5% and 20% test cases.

4) In both the 5% and 20% cases, the rejection rate for paired tests 
involving model-versus-observed trend differences ("Row 1") is 
comparable to the rejection rate for tests involving inter-model trend 
differences ("Row 2") arising from the combined effects of differences 
in internal variability, sensitivity, and applied forcings. On average, 
therefore, model versus observed trend differences are not noticeably 
more significant than the trends between any given pair of CMIP-3 
models. [N.B.: This inference is not entirely justified, since, "Row 2" 
convolves the effects of both inter-model differences and "within model" 
differences arising from the different manifestations of natural 
variability superimposed on the signal. We would need a "Row 4", which 
involves 19 x 18 paired tests of model results, using only one 20c3m 
realization from each model. I'll generate "Row 4" tomorrow.]

John, you also suggested that we might want to look at the statistical 
significance of trends in time series of differences - e.g., in O(t) 
minus M(t), or in M1(t) minus M2(t), where "O" denotes observations, and 
"M" denotes model, and t is an index of time in months. While I've done 
this in previous work (for example in the Santer et al. 2000 JGR paper, 
where we were looking at the statistical significance of trend 
differences between multiple observational upper air temperature 
datasets), I don't think it's advisable in this particular case. As your 
email notes, we are dealing here with A/OGCM results in which the 
phasing of El Ninos and La Ninas (and the effects of ENSO variability on 
T2 and T2LT) differs from the phasing in the real world. So differencing 
M(t) from O(t), or M2(t) from M1(t), probably actually amplifies rather 
than damps noise, particularly in the tropics, where the 
externally-forced component of M(t) or O(t) over 1979 to 1999 is only a 
relatively small fraction of the overall variance of the time series. I 
think this amplification of noise is a disadvantage in assessing whether 
trends in O(t) and M(t) are significantly different.

Anyway, thanks again for your comments and suggestions, John. They gave 
me a great opportunity to ignore the hundreds of emails that accumulated 
in my absence, and instead do some science!

With best regards,

Ben

John Lanzante wrote:
> Ben,
> 
> Perhaps a resampling test would be appropriate. The tests you have performed
> consist of pairing an observed time series (UAH or RSS MSU) with each one
> of 49 GCM times series from your "ensemble of opportunity". Significance
> of the difference between each pair of obs/GCM trends yields a certain
> number of "hits".
> 
> To determine a baseline for judging how likely it would be to obtain the
> given number of hits one could perform a set of resampling trials by
> treating one of the ensemble members as a surrogate observation. For each
> trial, select at random one of the 49 GCM members to be the "observation".
> From the remaining 48 members draw a bootstrap sample of 49, and perform
> 49 tests, yielding a certain number of "hits". Repeat this many times to
> generate a distribution of "hits".
> 
> The actual number of hits, based on the real observations could then be
> referenced to the Monte Carlo distribution to yield a probability that this
> could have occurred by chance. The basic idea is to see if the observed
> trend is inconsistent with the GCM ensemble of trends.
> 
> There are a couple of additional tweaks that could be applied to your method.
> You are currently computing trends for each of the two time series in the
> pair and assessing the significance of their differences. Why not first
> create a difference time series and assess the significance of it's trend?
> The advantage of this is that you would reduce somewhat the autocorrelation
> in the time series and hence the effect of the "degrees of freedom"
> adjustment. Since the GCM runs are based on coupled model runs this
> differencing would help remove the common externally forced variability,
> but not internally forced variability, so the adjustment would still be
> needed.
> 
> Another tweak would be to alter the significance level used to assess
> differences in trends. Currently you are using the 5% level, which yields
> only a small number of hits. If you made this less stringent you would get
> potentially more weaker hits. But it would all come out in the wash so to
> speak since the number of hits in the Monte Carlo simulations would increase
> as well. I suspect that increasing the number of expected hits would make the
> whole procedure more powerful/efficient in a statistical sense since you
> would no longer be dealing with a "rare event". In the current scheme, using
> a 5% level with 49 pairings you have an expected hit rate of 0.05 X 49 = 2.45.
> For example, if instead you used a 20% significance level you would have an
> expected hit rate of 0.20 X 49 = 9.8.
> 
> I hope this helps.
> 
> On an unrelated matter, I'm wondering a bit about the different versions of
> Leo's new radiosonde dataset (RAOBCORE). I was surprised to see that the
> latest version has considerably more tropospheric warming than I recalled
> from an earlier version that was written up in JCLI in 2007. I have a
> couple of questions that I'd like to ask Leo. One concern is that if we use
> the latest version of RAOBCORE is there a paper that we can reference --
> if this is not in a peer-reviewed journal is there a paper in submission?
> The other question is: could you briefly comment on the differences in 
> methodology used to generate the latest version of RAOBCORE as compared to 
> the version used in JCLI 2007, and what/when/where did changes occur to
> yield a stronger warming trend?
> 
> Best regards,
> 
> ______John
> 
> 
> 
> On Saturday 15 December 2007 12:21 pm, Thomas.R.Karl wrote:
>> Thanks Ben,
>>
>> You have the makings of a nice article.
>>
>> I note that we would expect to 10 cases that are significantly different 
>> by chance (based on the 196 tests at the .05 sig level).  You found 3.  
>> With appropriately corrected Leopold I suspect you will find there is 
>> indeed stat sig. similar trends incl. amplification.  Setting up the 
>> statistical testing should be interesting with this many combinations.
>>
>> Regards, Tom
> 


-- 
----------------------------------------------------------------------------
Benjamin D. Santer
Program for Climate Model Diagnosis and Intercomparison
Lawrence Livermore National Laboratory
P.O. Box 808, Mail Stop L-103
Livermore, CA 94550, U.S.A.
Tel:   (925) 422-2486
FAX:   (925) 422-7675
email: santer1@llnl.gov
---------------------------------------------------------------------------- 
</x-flowed>
