Description of the DNA microarray and its use
All laboratory protocols have been described in detail by Witney et al. [16] and are registered at BμG@Sbase http://bugs.sgul.ac.uk/E-BUGS-30.
The S. aureus DNA microarray used in this study, which consists of PCR-based probes for all open reading frames (ORFs) of seven S. aureus strains, has been described and validated previously and is summarised here [16]. In short, all ORFs of MRSA252, which served as base strain, were added to the microarray design followed by the addition of probes for genes from the other strains that are absent in, or show significant divergence from the genes of MRSA252 based on BLAST bit scores. The order in which the probes for the genes of the strains were added to the array was: MRSA252 (base strain), N315, Mu50, COL, NCTC8325, MW2, and MSSA476 [16]. In total, the microarray consisted of 3623 PCR products spotted in duplicate representing every predicted open reading frame of the seven strains [8, 16]. Around 75% of the PCR products (n = 5478 in duplo) represent MRSA252, while around 25% of the PCR products (n = 1768 in duplo) were obtained from the other six strains.
All strains were cultured on tryptic soy agar sheep blood plates at 37°C overnight. DNA of the reference strain and the test strains was isolated using the QIAGEN genomic-tip 100/G column and an Edge Biosystems Bactererial Genomic DNA purification kit (Edge Biosystems, Gateshead, United Kingdom).
DNA of all seven sequenced S. aureus strains, labelled with Cy3, was hybridised in duplicate on an array, with DNA of MRSA252, which was labelled with Cy5 as reference signal. Labeling was performed as described previously [8, 16].
Microarray images were quantified with ImaGene software (Biodiscovery, http://www.biodiscovery.com, El Segundo, California, United States). The two pictures per slide, one for every dye, were analysed separately in ImaGene. Fully annotated microarray data are deposited in BμG@Sbase http://bugs.sgul.ac.uk/E-BUGS-30 and have been retrieved for this study.
The background was calculated automatically and separately per colour and spot by ImaGene software.
Data-processing
Since a reference signal was available for 75% of the microarray spots, only for this part hybridisation data could be analysed using ratios of test, Cy3, and reference, Cy5, signal. For this purpose the data were dissected into two different data sets. The first data set consisted of the hybridisation results of the MRSA252 specific spots (75% of the spots in duplicate) containing a reference signal (further indicated as MRSA252 spots). Hybridisation signals for these MRSA252 spots were analysed using GACK (further referred to as GACK method), which is a well-documented standard analysis method [4, 17, 18] and also the new analysis methods developed in this study. The second data set that includes all spots, i.e. both the non-MRSA252 spots that lack a reference signal and the MRSA252 spots, were only analysed using the new developed methods. The MRSA252 spots were also analysed with the new analysis methods to compare the outcome of the methods with that of the GACK method.
In a first step only the MRSA252 spots on all slides were analysed. For every array the bad spots were filtered. Filtering was applied to exclude spots for which the reference, Cy5, signal was less than two times the background value of that particular spot. Also the spots, which ImaGene automatically defines as bad or empty spots, were excluded from the data. Slides were then normalised per array to correct for differences in labelling-efficiency, hybridisation, scanning conditions, and slide quality. For every particular spot the intensity of Cy3 minus the Cy3 background signal and Cy5 minus the Cy5 background signal was calculated, after which Cy3 to Cy5 ratios were calculated and log2-transformed. Per array the median and standard deviation of all ratios were calculated. The arrays were then subsequently normalised per spot, by first subtracting the median ratio of that particular array followed by dividing the resulting ratios per spot by the specific standard deviation of all ratios for that array. This specific normalisation is also called auto-scale normalisation. The estimated probability of presence of each gene or DNA fragment was determined using a GACK-transformation http://falkow.stanford.edu/whatwedo/software/software.html[19]. Using GACK transformation, it is possible to dynamically choose cut-offs for determining presence or divergence of genes or DNA fragments based on the shape of the distribution. For this study data were transformed to binary output using a threshold of 50% estimated probability of being present (EPP) in order to call genes present or divergent. To use GACK the data need to be normally distributed, which was the case in this study (Figure 1).
To analyse the second data set, which include the non-MRSA252 as well as the MRSA252 spots, all slides were again normalised separately. This means that first the intensities of Cy3 minus the Cy3 background signal for all spots and Cy5 minus the Cy5 background Cy5 signal for the MRSA252 spots were calculated. Because of the lack of a reference signal, the non-MRSA252 spots could not be normalised in the ratio dependent way or by auto-scale normalisation since the latter will reduce differences in the intensities, which compromises calculation of an accurate cut-off for gene presence or divergence. For this reason the second data set was analysed using different newly developed methods (see below). Filtering of the MRSA252 spots in this data set was performed identical as in the first data set; by excluding the automatically flagged spots by ImaGene and spots with a Cy5 signal intensity, which is less than two times the Cy5 background signal.
The results of the new analysis methods were validated by comparing the hybridisation data of the seven sequenced strains used in the array with the predicted presence and divergence of genes based on the GACK method and on the genome sequence data. Performance of the new method was determined by calculating sensitivity, specificity, PPV and NPV of the new methods. Sensitivity is calculated by dividing the number of genes predicted to be present in the control strains based on the new analysis method, by the number of genes that are considered to be present based on the annotation of Witney et al [16]. The calculated sensitivity of an analysis method was only based on the extra spots for each strain that were added to the array. For MRSA252 this means around 75% of the total of all spots on the array, since this strain was added first on the array. So, if for instance only 148 of in total 176 NCTC8325-specific sequences were identified as being present by an analysis method, the sensitivity of that analysis method would be 84.09%.
Specificity is the proportion of the divergent sequences, which are correctly identified as divergent sequences in an analysis method. This means that for, e.g., strain NCTC8325 all 170 probes spotted extra for strains MW2 and MSSA476 should be divergent in the NCTC8325 genes, since MW2 and MSSA476 were added after NCTC8325. If, however, in this case only 119 sequences were called divergent by an analysis method, the specificity would be 70%.
The positive predictive value (PPV) is the number of the true positive sequences divided by the total number of sequences that were indicated as positive (true and false-positive) in the different analysis methods. This means that if 104 of the 5484 probes that are called present for MRSA252 by an analysis method are false-positive, the PPV would be 98.10% (5380/5484)
The negative predictive value (NPV) is the number true divergent sequences divided by the total number of sequences indicated as divergent. So, in the case that only 902 of 922 sequences called divergent by an analysis method are true divergent, the NPV would be 97.83% (902/922).
For the MRSA252 spots, sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) of the new methods were calculated and compared with the values obtained with the GACK method.
In total six different approaches were used to analyse the data. In the first method a cut-off per array and in the other five methods a cut-off per spot was calculated.
Cut-off per array
To calculate the cut-off per array, log2 values of the intensities of the reference MRSA252, Cy5 signal, on the MRSA252 and the non-MRSA252 spots were determined. The cut-off is that intensity value where 95% of the MRSA252 spots will be called positive and 95% of the non-MRSA252 spots negative. Only the intensities of the MRSA252 reference signal were included because for this strain it is exactly known which sequences should be present and which diverge; all MRSA252 probes have to be present and all non-MRSA252 probes have to be divergent.
Cut-off per spot
Five different approaches were used to determine the cut-off value per spot for presence and divergence of the sequences, which are explained below. All approaches analysed the hybridisation signals for each spot separately and generated a cut-off per spot to determine whether a sequence was conserved or divergent. With the consequence that intensity above the cut-off means that a gene is present. Using these cut-offs, sensitivities and specificities, PPV and NPV were calculated as described above.
1. Cut-off based on two times the background
In the first approach non-MRSA252 sequences were considered present when the value of the Cy3 intensity of a particular spot was higher than twice the background signal of that spot. In short: gene is present when Cy3 intensity > 2 × Cy3 background signal.
2. Cut-off based on reference signal intensities
In the second approach the non-MRSA252 sequences were considered present when the value of the Cy3 intensity of a particular spot was higher than the Cy5 signal of that spot. Cy5 signals for non-MRSA252 spots are considered background noise in this approach, caused by, among other things, cross-hybridisation. So, in this second method, we hypothesized that signal intensity of the test strains for conserved sequences has to be at least above the value of the reference signal. In short: A gene is present when Cy3 intensity > Cy5 intensity.
3. Cut-off based on the minimal ratio of positivity
The first step in this third approach is to determine per array the lowest test/reference (Cy3/Cy5) intensity ratio for the MRSA252 spots, thus for spots for which a reference signal is available, and which were predicted to be positive based on GACK analysis. Subsequently the reference intensity of every individual spot was multiplied with the value of this lowest Cy3/Cy5 ratio. This way the lowest Cy3 value per spot predicted to be present was calculated. The last step in the calculation is that Cy3 intensities, which were greater than the Cy5 value of that spot multiplied by the lowest Cy3/Cy5 ratio, were considered to represent sequence presence. In short: A gene is present when Cy3 intensity > minimal ratio × Cy5 intensity.
4. Cut-off based on two times the background and reference signal intensities
This approach consists of a combined calculation of the first and second approach (see above). In this method sequences were considered present when the value of the Cy3 intensity of a particular spot was higher than twice the background signal and when the Cy3 intensity was higher than the Cy5 intensity of that spot. In short: A gene is present when Cy3 intensity > (2 × Cy3 background signal) and > Cy5 intensity.
5. Cut-off based on two times the background and the minimal ratio of positivity
This last approach is a combined calculation of the first and the third approach. The first step in the last approach is to determine per array the lowest Cy3/Cy5 ratio for MRSA252 spots, as described above. Cy3 intensities, that were twice the background and greater than the Cy5 value of that spot multiplied by the lowest Cy3/Cy5 ratio were considered to represent sequences presence. In short: A gene is present when Cy3 intensity > (2 × Cy3 background signal) and > (minimal ratio × Cy5 intensity).