Bug #441
open
  
    
    
  
        
        Added by Anonymous about 14 years ago.
        Updated about 14 years ago.
        
  
  
  
  Description
  
  This is a feature request, not a bug.
	I have seen a large number of F3200 "suspecting replicated data" warnings, most of them spurious.  HDH eliminated about 90% of the unnecessary F3200 warnings by providing an option to issue no F3200's for 0-D variables.  Even so, in my experience the majority of F3200's have been spurious.
	The F3200 exception is based on a comparing few indicators such as mean and maximum.  For my own use, I wrote (in Python) a filter to read the suspect data a second time and compare the entire data arrays, item by item.  For the comparison I used the method favored by numerical analysts and implemented by the function numpy.ma.allclose(): scalars a and b are close if |a-b|<atol+rtol*max(a,b).  My default tolerances are atol=1.e-9 and rtol=1.e-6, but any small numbers would work.  So far I have sent hundreds of data sets through this filter, and now I have zero spurious warnings.
	I think that a better place for this second-pass check would be the QC tool.
   
 
 
  
  
    
    
    
    
       - Status changed from New to In Progress
The approach used in the QC to detect replicated records is not robust in an absolute sense, I agree. 
Already two almost identical records with only two grid-cell values swapped would fail.
The reason for implementing it this way was firstly to be quick and secondly to hold across different sub-temporal files.
The hope was that global fluctuations were such high preventing false alarms. Unfortunately, there are variables with
only a few valid grid-cell values, e.g. some kind of snow/ice properties.
	The numpy.ma.allclose algorithm is robust as there are others, too. But, it does not function across different sub-temporal
files which are quiet common; the QC does not know sub-temporal files, but the current one it is checking. Additionally,
the expenditure of time would be rather high for always opening all previous records at each current record.
	I would like to propose to calculate a checksum (md5) of each record and compare these. The checksums of previous 
sub-temporal files would be stored additionally in the qc<filename>.nc result file.
 
   
  
  
    
    
    
    two comments:
	The existing replication check works well as a first pass.  I have found the detailed direct check useful only as a second pass, for removing spurious results.  I agree that it would be impossibly slow to do a detailed comparison of all pairs of records!  I was only suggesting the use of a detailed check as the second pass of a two-pass algorithm.
	Checksums are a good idea.  The tricky part is for floating-point numbers where bit-for-bit agreement may not be the only relevant kind of identity.  Maybe some kind of rounding could be part of a checksum-based algorithm.
 
   
  
 
  
  
 
Also available in:  Atom
  PDF