date: Wed, 2 May 2007 15:32:52 +0100
from: Ian Harris <i.harris@uea.ac.uk>
subject: Progress
to: Phil Jones <p.jones@uea.ac.uk>

<x-flowed>
Hi Phil

Sorry to miss you this morning.

Just to let you know that I've found several potentially-major  
problems with the anomaly program anomdtb, which as far as I know was  
used to produce CRU TS 2.1.

In the 'duplication' section, stations within 8km (ref. notes and the  
Mitchell & Jones IJC paper) of each other are rolled together: the  
station with the lower WMO code donates its data to fill any missing  
values in the second station, and is then marked for no further use  
by setting its WMO code to -999 (in the internal arrays, obviously).

I was investigating the high number of duplications found, and  
discovered that, even though stations were marked for exclusion in  
this way, they continued to be evaluated as possible duplicates and  
so could contribute the same data to multiple stations. Problem One,  
worrying but not critical.

There is no protection to prevent a chain of stations, all within 8km  
of their neighbours, to pass inherited data from one end to the  
other, a distance which could be well over 8km. Since no context  
checking is done on inherited data to ascertain suitability, this  
could result in inappropriate values being inserted into a station  
some distance from the originator. Problem Two, worrying but probably  
not critical.

Now the killer. The 'duplication' test calls a routine with two pairs  
of lat/lon values and gets back an approximate Greta Circle distance  
between them, in km. If this figure is below the threshold (set at  
8km) then the process is initiated. However, for reasons I have yet  
to fathom, the lats and lons are scaled by 0.01 when they are read  
into the arrays, and so most of the duplication incidents are false!  
For example, these two stations are flagged as duplicated and the  
first (Lugano) is excluded:

   67700   460    -90  273 LUGANO               SWITZERLAND   1864  
2006 101864  -999.00
160660   456    -87 -999 MILANO MALPENSA      ITALY         1961 1970  
101961  -999.00

Yet they are over 50km apart! The faulty routine says the distance is  
5.4km because it sees lats of 4.56 and 4.60, and lons of -0.90 and  
-0.87. Problem Three, probably critical.

I'll make the necessary adjustments. I think the problem is the read  
routine, and how it decides which scaling factor to use - so it's  
possible CRU TS 2.1 escaped if it used a different data set style.

Just thought you ought to know.

Cheers

Harry
Ian "Harry" Harris
Climatic Research Unit
School of Environmental Sciences
University of East Anglia
Norwich NR4 7TJ
United Kingdom


</x-flowed>
