Study population
The study population consisted of women enrolled into an ongoing study of cervical neoplasia in high-risk urban women [10]. Participants were recruited from non-pregnant, HIV-negative women, aged 18–69 years, attending colposcopy clinics at urban public hospitals in Atlanta, Georgia and Detroit, Michigan. Specimens used in this study were from participants enrolled between December 2000 and December 2002. Cervical disease status was determined based on the summary results of cytology, colposcopy and biopsy examination. We selected 15 women with high grade lesions (CIN3) as cases and age (± 4 years) and race matched women without or with only low grade lesions as controls (7 CIN0 and 8 CIN1).
Sample collection and RNA extraction
After visualization of the cervix, ecto- and endocervical cells were collected using a CytoBroom (Cytyc Corporation, Malborough, MA) and dislodged into PreservCyt collection media (Cytyc Corporation). If a cytology diagnosis was required, the collection device was used to prepare a conventional Pap smear and then placed into the PreservCyt collection media. Samples were transported to the laboratory at ambient temperature and stored at 4°C until processed. Within two weeks of sample collection, total nucleic acids (TNA) were extracted from 14 ml of each 20 ml PreservCyt sample using modifications of the MasterPure Complete DNA and RNA Purification kit (Epicentre, Madison, WI) as previously described (Habis et al 2004). The TNA extract was resuspended in 50 μL TE buffer with 50 units of RNasin (Promega Corporation, Madison, WI) and stored at -70°C until use. Total RNA derived from normal uterine cervix tissue (age 48, unknown ethnicity) was purchased (Stratagene®, La Jolla, CA). Quality of all samples was visually evaluated by gel electrophoresis and quantitation was assessed by densitometric measurement (FluorChem® Digital Imaging System, Alpha Innotech, Inc., San Leandro, CA) of the ribosomal bands, with comparison to a standard 28S and 18S control marker.
Microarray assays
We used MWG Human 30 k Arrays (A/ B/ C) (MWG Biotech, Ebersberg, Germany). Each array was hybridized with cDNA prepared from 500 ng total RNA. Conditions for labeling and hybridization were as described elsewhere [11]. Briefly, samples were pretreated with DNase I and cDNA was prepared and labeled with biotin-11-dUTP (Enzo, Farmingdale, NY) using SuperScript™ First-Strand Synthesis System for RT-PCR (Invitrogen, Carlsbad, CA) with oligo dT and random primers. The automated Discovery™ System (Ventana Medical Systems, Tucson, AZ) was used to hybridize slides for 8 hours at 42°C, and detect hybridization with anti-biotin Gold Resonance Light Scattering (RLS) Particles (Invitrogen). Slides were scanned with the GSD-501™ RLS scanner (Invitrogen) and 16-Bit Tiff images were subsequently quantified with Array Vision™ Software 8.0 (Imaging Research, St Catherines, ON, Canada). We used sARM values (artifact removed density minus the background density) of each feature for statistical analysis and a signal to noise ratio (S/N = sARM, divided by the SD of the background density) of 1.5 as the cut-off for detection of gene expression.
Comparison of gene expression in cervical exfoliated cells and uterine cervix tissue
We used the results of the 7 samples of exfoliated cells from women with no abnormalities (CIN0) to characterize the profile of exfoliated cells and of the uterine cervix tissue RNA assayed in duplicate to characterize the tissue profile. A gene was included in the profile if detected in >85% of the exfoliated samples (6 of 7) or in both replicates of the tissue sample. We used the web based database for annotation, visualization and integrated discovery (DAVID) http://david.niaid.nih.gov/David/[12] for functional annotation and ontology of the detected genes, and the expression analysis systematic explorer (EASE) for identification of enriched biological themes within the gene lists as reflected by an EASE score of < 0.025 [13].
Differential gene expression in CIN0/CIN1 and CIN3 samples
Expression data were derived from the results of all 30 exfoliated samples hybridized to MWG A arrays. Features were restricted to those with a S/N above 1.5 detected in at least 12 of the 15 samples (80%) of either class (CIN0/CIN1 or CIN3). 5461 genes that passed this filter were subjected to further statistical analysis.
We calculated the mean and coefficient of variation (CV) of the log2 transformed median centered sARM for each gene within the CIN 3, CIN 0 and CIN 1/CIN 0 groups. We used the CV in expression of each gene as a measure of homogeneity. For each gene we calculated the difference between the CV of the CIN 3 group minus that of the other groups and plotted these values in descending order to visualize discrepancies in the variation of genes expression between the classes.
To identify expression differences between the two groups we used the following three different microarray-adapted statistical software packages:
(1) BRB Array Tools 3.2 http://linus.nci.nih.gov/BRB-ArrayTools: Log2 transformed sARM values were normalized over the median of each array and subjected to a two-sample T-test with a random variance model and 1000 permutations. We considered the top 20 genes with the lowest univariate parametric p-values as differentially expressed. Multivariate permutation tests were applied to estimate the proportion of false discoveries in the discovery list. The indicated genes were assigned to gene ontology (GO) categories. Additionally, all GO categories that included at least 5 genes represented on the microarray were analyzed to identify biological themes overrepresented in genes differentially expressed. A GO category was selected if its corresponding LS or KS permutation p-value was below the threshold of 0.005 [14].
(2) Significant analysis of microarrays (SAM) 1.21 http://www-stat.stanford.edu/~tibs/SAM/: sARM values were log2 transformed and centered to the median of each array. Normalized data were analyzed in an unpaired, two-class model with gene specific t-tests and 200 permutations to estimate false discovery rate (FDR) from multiple testing. Genes were scored based on change in expression relative to the standard deviation of the repeated measurements in order to identify those with differential expression [15].
(3) Focus 5.1 http://microarray.genetics.ucla.edu/focus/: Raw sARM data were normalized to a modified Z-transformation and tested for the hypothesis, gene expression in CIN3 samples is upregulated over controls. The applied contrast coefficient of 1.0 scored genes directly according to the average intensity difference between the two classes [16]. Genes agreeing best with the hypothesized pattern change were trimmed to a list of 20 with the highest interest scores and at least a 2-fold change.