Algorithm
The most fundamental change from the original algorithm is the introduction of background correction based on the median signal of antigenomic probes having the same GC-content as the given PM probe, rather than relying on the cognate MM probe hybridization signal. The GCSscore algorithm is based upon the operations in eqs. (1) through (3). For a given probe grouping method (as defined below), k, which is made up of N probes, the GCS-score, denoted as GCSsk, is as follows:
$$ {GCSs}_k=\sum \limits_{i=1}^N\frac{l_{iB}-{l}_{iA}}{\varepsilon_I\sqrt{N}} $$
(1)
$$ {\varepsilon}_i=\sqrt{\gamma^2\left({l}_{iB}^2-{l}_{iA}^2\right)+{SDT}_A^2+{SDT}_B^2} $$
(2)
$$ SDT=4\ast raw\mathrm{Q}\ast SF $$
(3)
In the algorithm equations, liA and liB, represent the background corrected intensities of the i-th probe pair from array A and B, respectively. As defined in the Affymetrix MicroArray Suite (MAS) documentation, the significant difference threshold (SDT) is determined by the noise floor of each array and the chosen scaling factor. The noise floor (rawQ) is calculated from the standard deviation of the bottom 2% of the probe intensities across the array. The scaling factor (SF) for each array is a multiplier that scales the median intensities to a target value (default is 500). The gamma factor is set to 0.1 to prevent calculated GCS-score values from being affected by gene expression levels [2].
The GCSscore package imports functions from the following CRAN/BioConductor packages: BiocManager, Biobase, utils, methods, RSQLite, devtools, dplR, stringr, graphics, stats, affxparser, and data.table. If it is desirable to pull datasets for GEO or perform the downstream analysis presented in this publication, the following additional packages are necessary: siggenes, GEOquery, and R.utils. All probe-level data and annotations utilized by the GCSscore package are parsed and sourced directly from the following chip specific BioConductor packages: platform design (.pd) and AnnotationDbi (.db). The resulting probe-level data file is packaged into a ‘probefile’ package, while the annotations are packaged into an additional ‘annot’ package. These packages are created on the fly and installed in the user’s library by utilizing customized versions of the makeProbePackage function and package templates sourced from the AnnotationForge package [16].
While the theory of the GCSscore method can be applied to any modern Affymetrix/Thermo-Fisher chip type, the R package was written for use with Clariom-style array, which include: ClariomS, ClariomD, and XTA assay chip types. This package fully supports all Clariom-style arrays and has support for two of the most widely used 3′ IVT arrays: Mouse Genome 430 2.0 and Human Genome UG133 2.0. For older types of Affymetrix arrays, the original sscore package must be used. The GCSscore algorithm allows the user to calculate GCS-score values for ClariomD/XTA arrays using two probe grouping methods: (1) utilizes TCids groupings for gene-level, (2) utilizes PSRids and JUCids for exon & alternate splicing-level. Since the ClariomS arrays only contain TCids, there is only a gene-level analysis method. Additionally, for supported 3′ IVT chips, the method refers to two background subtraction options: (1) utilizes the new GC-bkg method (2) utilizes the original PM-MM method.
The GCSscore package allows for direct probe-level comparisons of two Affymetrix microarrays at a time. The user can either input two. CEL files directly into the function, or read in a formatted batch file that is setup to run pair-wise comparisons of multiple. CEL files in a single function call (Additional File 1: Table S1). For more information regarding the implementation, please refer to the workflow diagram (Additional File 1: Figure S1). In brief:. CEL files are scaled to have equivalent trimmed median intensities for the desired probe grouping method. The GCS-scores are calculated and normalized using the middle 98% of the raw scores. Finally, normalized results are combined with the annotation information parsed from the Bioconductor repository and are returned to the user’s environment using the Biobase data structure, ExpressionSet. The user can also choose to save the GCS-score results to disk, as a. CSV file.
Statistical properties and analysis of GCS-scores
One principal advantage of the GCSscore based method is the simple Gaussian-like statistics of the resulting output (Additional File 1: Figure S2). If no extreme differential expression exists between two. CEL files then the GCSscore output will have a mean of 0 and a SD of 1 [2]. Since, each run is essentially z-scored and normalized prior to output, each GCS-score becomes a representation of the standard deviation from the mean of a Gaussian-like distribution. Thus, the absolute values of GCS-scores greater than 1.8–2.0 are likely to be statistically significant and this can be determined by using statistical testing of biological replicates with correction for multiple testing, as done with the SAM method below.
Workflow for generating differential expression for downstream analysis
In our standard implementation, all treatment samples were run against all control samples in a pairwise fashion. For example, if there are 3 replicates for the treatment and 3 replicates for the control group, there will be 9 total pairwise comparisons (Additional File 1: Table S1). The GCS-scores were averaged for each treatment sample against all 3 control samples, producing 3 averaged GCS-score values, one for each of the treatment samples. This was done to reduce noise with small sample sizes and to prevent over inflation of sample numbers that would occur from taking all of the pairwise comparisons into account [17]. Alternatively, random pairings of treatment/control samples can be used for generation of GCS-scores [5]. Multiple-testing correction of the GCS-score differential expression analysis can then be applied. For the statistical analyses presented in this publication, the averages of each treatment replicate against all of the control samples were used as the input into a 1-class Significance Analysis of Microarrays (SAM) analysis for multiple testing correction in identifying genes with GCS-score values statistically different from 0, as demonstrated in prior publications with the original Sscore algorithm [5, 9, 17]. The SAM algorithm used here was provided by the siggenes package from Bioconductor. More complex experimental designs implement multiple group testing in SAM or other appropriate statistical methods, such as LIMMA [18]. The average of these treatment replicate averages, denoted as AvgSs, was used as an additional stringent filter to decrease contributions from genes with exceedingly small fold-changes, as reported previously [5, 9, 17]. To determine significantly regulated gene lists, the following criteria were thus used: genes from the SAM output within a determined FDR cutoff (e.g. 0.0125–0.1) and genes who also have |AvgSs| > 1.8. In the Clariom-based array cases explored in parts B and C of the Results section, downstream analysis of the gene lists was performed using ToppFun suite for Gene Ontology/Functional enrichment analysis and Ingenuity Pathway Analysis (IPA) to find significantly altered signaling pathways. For the ToppFun analysis, only the gene symbols from the generated gene lists were input directly into the suite. For the IPA analysis, the TCids, AvgSs, and rawp values from the SAM output were used as the input.