From: Ben Santer <santer1@llnl.gov>
To: John Lanzante <John.Lanzante@noaa.gov>,  Thomas R Karl <Thomas.R.Karl@noaa.gov>, carl mears <mears@remss.com>, "David C. Bader" <bader2@llnl.gov>,  "'Dian J. Seidel'" <dian.seidel@noaa.gov>, "'Francis W. Zwiers'" <francis.zwiers@ec.gc.ca>,  Frank Wentz <frank.wentz@remss.com>, Karl Taylor <taylor13@llnl.gov>,  Leopold Haimberger <leopold.haimberger@univie.ac.at>, Melissa Free <Melissa.Free@noaa.gov>,  "Michael C. MacCracken" <mmaccrac@comcast.net>, "'Philip D. Jones'" <p.jones@uea.ac.uk>,  Steven Sherwood <Steven.Sherwood@yale.edu>, Steve Klein <klein21@mail.llnl.gov>,  'Susan Solomon' <ssolomon@al.noaa.gov>, "Thorne, Peter" <peter.thorne@metoffice.gov.uk>,  Tim Osborn <t.osborn@uea.ac.uk>, Tom Wigley <wigley@cgd.ucar.edu>, Gavin Schmidt <gschmidt@giss.nasa.gov>
Subject: More significance testing
Date: Thu, 27 Dec 2007 16:26:19 -0800
Reply-to:  santer1@llnl.gov

<x-flowed>
Dear folks,

This email briefly summarizes the trend significance test results. As I 
mentioned in yesterday's email, I've added a new case (referred to as 
"TYPE3" below). I've also added results for tests with a stipulated 10% 
significance level. Here is the explanation of the four different types 
of trend test:

1. "OBS-vs-MODEL": Observed MSU trends in RSS and UAH are tested against 
trends in synthetic MSU data in 49 realizations of the 20c3m experiment. 
Results from RSS and UAH are pooled, yielding a total of 98 tests for T2 
trends and 98 tests for T2LT trends.

2. "MODEL-vs-MODEL (TYPE1)": Involves model data only. Trend in 
synthetic MSU data in each of 49 20c3m realizations is tested against 
each trend in the remaining 48 realizations (i.e., no trend tests 
involving identical data). Yields a total of 49 x 48 = 2352 tests. The 
significance of trend differences is a function of BOTH inter-model 
differences (in climate sensitivity, applied 20c3m forcings, and the 
amplitude of variability) AND "within-model" effects (i.e., is related 
to the different manifestations of natural internal variability 
superimposed on the underlying forced response).

3. "MODEL-vs-MODEL (TYPE2)": Involves model data only. Limited to the M 
models with multiple realizations of the 20c3m experiment. For each of 
these M models, the number of unique combinations C of N 20c3m 
realizations into R trend pairs is determined. For example, in the case 
of N = 5, C = N! / [ R!(N-R)! ] = 10. The significance of trend 
differences is solely a function of "within-model" effects (i.e., is 
related to the different manifestations of natural internal variability 
superimposed on the underlying forced response). There are a total of 62 
tests (not 124, as I erroneously reported yesterday!)

4. "MODEL-vs-MODEL (TYPE3)": Involves model data only. For each of the 
19 models, only the first 20c3m realization is used. The trend in each 
model's first 20c3m realization is tested against each trend in the 
first 20c3m realization of the remaining 18 models. Yields a total of 19 
x 18 = 342 tests. The significance of trend differences is solely a 
function of inter-model differences (in climate sensitivity, applied 
20c3m forcings, and the amplitude of variability).

REJECTION RATES FOR STIPULATED  5% SIGNIFICANCE LEVEL
Test type                  No. of tests       T2 "Hits"     T2LT "Hits"
1. OBS-vs-MODEL            49 x 2    (98)     2  (2.04%)     1  (1.02%)
2. MODEL-vs-MODEL (TYPE1)  49 x 48 (2352)    58  (2.47%)    32  (1.36%)
3. MODEL-vs-MODEL (TYPE2)    ---     (62)     0  (0.00%)     0  (0.00%)
4. MODEL-vs-MODEL (TYPE3)  19 x 18  (342)    22  (6.43%)    14  (4.09%)

REJECTION RATES FOR STIPULATED 10% SIGNIFICANCE LEVEL
Test type                  No. of tests       T2 "Hits"     T2LT "Hits"
1. OBS-vs-MODEL            49 x 2    (98)     4  (4.08%)     2  (2.04%)
2. MODEL-vs-MODEL (TYPE1)  49 x 48 (2352)    80  (3.40%)    46  (1.96%)
3. MODEL-vs-MODEL (TYPE2)    ---     (62)     1  (1.61%)     0  (0.00%)
4. MODEL-vs-MODEL (TYPE3)  19 x 18  (342)    28  (8.19%)    20  (5.85%)

REJECTION RATES FOR STIPULATED 20% SIGNIFICANCE LEVEL
Test type                  No. of tests       T2 "Hits"     T2LT "Hits"
1. OBS-vs-MODEL            49 x 2    (98)     7  (7.14%)     5  (5.10%)
2. MODEL-vs-MODEL (TYPE1)  49 x 48 (2352)   176  (7.48%)   100  (4.25%)
3. MODEL-vs-MODEL (TYPE2)    ---     (62)     4  (6.45%)     3  (4.84%)
4. MODEL-vs-MODEL (TYPE3)  19 x 18  (342)    42 (12.28%)    28  (8.19%)

Features of interest:

A) As you might expect, for each of the three significance levels, TYPE3 
tests yield the highest rejection rates of the null hypothesis of "No 
significant difference in trend". TYPE2 tests yield the lowest rejection 
rates. This is simply telling us that the inter-model differences in 
trends tend to be larger than the "between-realization" differences in 
trends in any individual model.

B) Rejection rates for the model-versus-observed trend tests are 
consistently LOWER than for the model-versus-model (TYPE3) tests. On 
average, therefore, the tropospheric trend differences between the 
observational datasets used here (RSS and UAH) and the synthetic MSU 
temperatures calculated from 19 CMIP-3 models are actually LESS 
SIGNIFICANT than the inter-model trend differences arising from 
differences in sensitivity, 20c3m forcings, and levels of variability.

I also thought that it would be fun to use the model data to explore the 
implications of Douglass et al.'s flawed statistical procedure. Recall 
that Douglass et al. compare (in their Table III) the observed T2 and 
T2LT trends in RSS and UAH with the overall means of the multi-model 
distributions of T2 and T2LT trends. Their standard error, sigma{SE}, is 
meant to represent an "estimate of the uncertainty of the mean" (i.e., 
the mean trend). sigma{SE} is given as:

sigma{SE} = sigma / sqrt{N - 1}

where sigma is the standard deviation of the model trends, and N is "the 
number of independent models" (22 in their case). Douglass et al. 
apparently estimate sigma using ensemble-mean trends for each model (if 
20c3m ensembles are available).

So what happens if we apply this procedure using model data only? This 
is rather easy to do. As above (in the TYPE1, TYPE2, and TYPE3 tests), I 
simply used the synthetic MSU trends from the 19 CMIP-3 models employed 
in our CCSP Report and in Santer et al. 2005 (so N = 19). For each 
model, I calculated the ensemble-mean 20c3m trend over 1979 to 1999 
(where multiple 20c3m realizations were available). Let's call these 
mean trends b{j}, where j (the index over models) = 1, 2, .. 19. 
Further, let's regard b{1} as the surrogate observations, and then use 
Douglass et al.'s approach to test whether b{1} is significantly 
different from the overall mean of the remaining 18 members of b{j}. 
Then repeat with b{2} as surrogate observations, etc. For each 
layer-averaged temperature series, this yields 19 tests of the 
significance of differences in mean trends.

To give you a feel for this stuff, I've reproduced below the results for 
tests involving T2LT trends. The "OBS" column is the ensemble-mean T2LT 
trend in the surrogate observations. "MODAVE" is the overall mean trend 
in the 18 remaining members of the distribution, and "SIGMA" is the 
1-sigma standard deviation of these trends. "SIGMA{SE}" is 1 x 
SIGMA{SE} (note that Douglass et al. give 2 x SIGMA{SE} in their Table 
III; multiplying our SIGMA{SE} results by two gives values similar to 
theirs). "NORMD" is simply the normalized difference (OBS-MODAVE) / 
SIGMA{SE}, and "P-VALUE" is the p-value for the normalized difference, 
assuming that this difference is approximately normally distributed.

MODEL          "OBS"     MODAVE    SIGMA   SIGMA{SE}   NORMD     P-VALUE 

CCSM3.0        0.1580    0.2179    0.0910    0.0215    2.7918    0.0052 

GFDL2.0        0.2576    0.2124    0.0915    0.0216    2.0977    0.0359 

GFDL2.1        0.3567    0.2069    0.0854    0.0201    7.4404    0.0000 

GISS_EH        0.1477    0.2185    0.0906    0.0214    3.3153    0.0009 

GISS_ER        0.1938    0.2159    0.0919    0.0217    1.0205    0.3075
MIROC3.2_T42   0.1285    0.2196    0.0897    0.0211    4.3094    0.0000
MIROC3.2_T106  0.2298    0.2139    0.0920    0.0217    0.7305    0.4651
MRI2.3.2a      0.2800    0.2111    0.0907    0.0214    3.2196    0.0013 

PCM            0.1496    0.2184    0.0907    0.0214    3.2170    0.0013 

HADCM3         0.1936    0.2159    0.0919    0.0217    1.0327    0.3018 

HADGEM1        0.3099    0.2095    0.0891    0.0210    4.7784    0.0000 

CCCMA3.1       0.4236    0.2032    0.0769    0.0181   12.1591    0.0000 

CNRM3.0        0.2409    0.2133    0.0918    0.0216    1.2762    0.2019 

CSIRO3.0       0.2780    0.2113    0.0908    0.0214    3.1195    0.0018
ECHAM5         0.1252    0.2197    0.0895    0.0211    4.4815    0.0000
IAP_FGOALS1.0  0.1834    0.2165    0.0917    0.0216    1.5314    0.1257
GISS_AOM       0.1788    0.2168    0.0916    0.0216    1.7579    0.0788
INMCM3.0       0.0197    0.2256    0.0790    0.0186   11.0541    0.0000
IPSL_CM4       0.2258    0.2142    0.0920    0.0217    0.5359    0.5920

T2LT: No. of p-values .le. 0.05: 12.  Rejection rate:  63.16%
T2LT: No. of p-values .le. 0.10: 13.  Rejection rate:  68.42%
T2LT: No. of p-values .le. 0.20: 14.  Rejection rate:  73.68%

The corresponding rejection rates for the tests involving T2 data are:

T2:   No. of p-values .le. 0.05: 12.  Rejection rate:  63.16%
T2:   No. of p-values .le. 0.10: 13.  Rejection rate:  68.42%
T2:   No. of p-values .le. 0.20: 15.  Rejection rate:  78.95%

Bottom line: If we applied Douglass et al.'s ridiculous test of 
difference in mean trends to model data only - in fact, to virtually the 
same model data they used in their paper - one would conclude that 
nearly two-thirds of the individual models had trends that were 
significantly different from the multi-model mean trend! To follow 
Douglass et al.'s flawed logic, this would mean that two-thirds of the 
models really aren't models after all...

Happy New Year to all of you!

With best regards,

Ben
----------------------------------------------------------------------------
Benjamin D. Santer
Program for Climate Model Diagnosis and Intercomparison
Lawrence Livermore National Laboratory
P.O. Box 808, Mail Stop L-103
Livermore, CA 94550, U.S.A.
Tel:   (925) 422-2486
FAX:   (925) 422-7675
email: santer1@llnl.gov
---------------------------------------------------------------------------- 
</x-flowed>

