Gene expression profiling using microarrays plays an important role in many areas of biology. Microarray data however often contains many missing values. Among the most commonly used computer analysis tools that require imputation of missing values are data dimensionality reducing algorithms such as principal component analysis (PCA) and singular value decomposition, and machine learning algorithms such as support vector machines. Advanced imputation methods have therefore been developed, such as KNNimpute [1], Bayesian PCA [2] and LLSimpute [3], which all are based on correlations between available measurements in the data matrix (samples × reporters). In KNNimpute, *e.g.*, a weighted average of the *K* most similar genes (defined by Euclidean distance) is used to derive an estimate of a missing value in the gene of interest [1]. Missing values and the choice of imputation method has also been shown to affect the significance analysis of differentially expressed genes [4, 5].

Missing values can occur due to dust or scratches on the slide, spotting problems or hybridization problems. Obviously problematic spots are manually flagged as missing and are removed from further analysis. It is customary to subtract background intensities from the spot intensity, and this also produces missing values. Negative background-corrected intensities arise if the spot intensity is comparable to the background intensity, either due to contamination of the spot, leakage from neighbouring spots, or from low abundance of dyed cDNA in the reference or sample.

Obviously, the choice of spot quality assessment and transform of intensities of the resulting values will influence the final result of imputation as well as the analysis as a whole. Error models, and transforms other than the logarithm, have been developed, designed to give reliable variance estimates in the absence of replicates [6, 7] or generate a well-defined distribution of expression values [8].

However, more heuristic approaches remain in use. One reason could be that many transforms or weights require careful tuning of parameters, which tend to be platform dependent [7], and perhaps even dataset specific. Results from experiments designed to determine such parameters are not always available. Another reason is that the biological variation within traits tends to dominate over technical variation. When the study is large enough to get a reliable sample estimate of the total variation within traits, there is less need for information on technical variation alone. Furthermore, instead of relying on a specific distribution of expression values, *p*-values and false discovery rates [9] are often obtained by non-parametric tests [10] or empirical methods such as permutation tests [11, 12].

In heuristic analyses, some spot quality control is still performed, often in terms of threshold values in observables such as spot size, intensity, background variation, or combinations thereof, which are used to flag spots as missing. An undesired feature of this approach is the sharp threshold effects. A spot with an observable, say reference intensity, just below the chosen threshold will be deemed "completely unreliable", while a spot with essentially the same intensity, but just above the threshold will be considered "perfectly reliable".

Different smoothings of threshold effects have therefore been developed [7, 13–15]. Smoothing introduces continuous weights, ranging from 0 for missing or "completely unreliable" measurements, to 1 for "perfectly reliable" ones, but also taking on values in between. The chosen weight is related to some commonly used threshold observables and threshold values, typically tuned to be about 1/2 at the otherwise adopted threshold.

Weights *w* associated to the expression values *x* (with *w* = 0 for missing values) can be used to improve imputation [16]. For every spot, measured as well as missing, an imputed value *x*
_{imp} is calculated and an adjusted value

*x'* = *wx* + (1 - *w*)*x*_{imp} (1)

is used in the subsequent analysis. Thus, missing and "completely unreliable" values *x* are replaced by *x*
_{imp}, "perfectly reliable" measurements *x* are kept, and spots with weights between 0 and 1 will end up with an expression value somewhere between the imputed and the directly measured value.

Weighted imputation requires a weight definition which ranges from 0 to 1. This property is used in eq. 1 and in the selection of the number of nearest neighbours [16]. The range constraint on the weights excludes weighting schemes that combine an error model estimate of the variance *σ*
^{2} with a weight motivated by maximum likelihood, *w* = 1/*σ*
^{2}. Instead, weights representing a smoothing of an otherwise adopted threshold filter satisfy the range constraint.

For 2-dye cDNA microarrays, it is common to impute missing values using the data matrix of log intensity ratios. In this approach, no information as to why a measurement is missing is included. It is also possible to impute intensities for each channel separately and form the log ratio from them [5]. Different forms of missing values are then handled differently, but the imputation of a missing intensity is performed without use of the information provided by the other channel of the same spot.

We divided missing values of 2-dye cDNA data into three categories; those that are missing due to a missing sample intensity only (sample depleted spots), a missing reference intensity only (reference depleted spots), or other reasons. We examined if this categorization can be used to improve imputation of expression values. We wanted to investigate imputation of one-channel depleted spots using the best imputation scheme available. We therefore worked with the weighted version presented in [16] and described in our methods section.