Reagents
Ammonium carbonate, ammonium bicarbonate, urea, formic acid, lysozyme, 2-Iodoethanol, and triethylphosphine were all purchased from Sigma-Aldrich (St. Louis, MO, USA). Acetonitrile and MS grade water were purchased from Honey Burdick & Jackson (Morristown, NJ, USA). Trypsin was purchased from Worthington Biochemical Corporation (Lakewood, NJ, USA). Seppro tip IgY-12 and reagent kit were purchased from GenWay Biotech (San Diego, CA, USA).
Human plasma samples
Plasma protein profiles were collected by the Hoosier Oncology Group (HOG) (Indianapolis, IN, USA) in two batches, which we refer to as Study II and III (each contained 40 plasma samples from women with breast cancer and 40 plasma samples from healthy age-matched volunteer women as control). Most of patients involved in the two studies were diagnosed with a stage II or III or earlier breast cancer. Most patients had previously been treated with chemotherapy. All samples were collected with the same standard operating procedure and stored in a central repository in Indianapolis, IN, USA. The demography and clinical distribution of breast cancer stages/subtypes for Study II and III are comparable (Additional file 1: Table S1). In Study II, there are 9 metastasis and 30 non-metastasis, 30 INV and 10 DCIS, mean tumor size 1.56, 8GI, 11 GII and 15 GIII. In Study III, there are 1 metastasis and 20 non-metastasis, 23 INV and 8 DCIS, mean tumor size 1.93, 3GI, 9GII and 18GIII.
Proteomics methods
Biomarker identification and characterization holds great promise for more precise diagnoses and for tailored therapies. The heterogeneity of human cancers and unmet medical needs in these diseases provides a compelling argument to focus biomarker development in cancer. Mass Spectrometry (MS)- based proteomics approaches have provided insight into biomarkers of cancer and other diseases with femtomole sensitivity and high analytical precision. We presented a four steps pipeline for the identification and validation of isoform-specific peptide biomarkers from breast cancer proteomics: Peptide Search Database Construction, Peptide Identification and Quantification, Statistical Identification of Isoform Markers, and Validation.
Peptide search database construction
A comprehensive database of human peptides characteristic of all known and theoretic protein isoforms was developed in three steps: 1) obtaining gene structures of all protein-coding genes in the human genome, 2) compiling in silico isoform junction peptides, and 3) validating those peptides in current protein knowledgebase.
First, we downloaded all information about human genes in the Ensemble [13]. We retrieved gene information such as name, position, exon phase, exon/intron coordinates, and annotation. Exons which overlap with each other were classified into a group, and a serial number was assigned to each group according to its order in the sequence. For instance, the first group in the gene would be marked as group one, and the second as group two, etc. Introns can be obtained by the sequence between two exons.
Then, we generated in silico Isoform Junction Peptides (IJP), which contains two types of peptides: the peptides translated from all exons and the ones that are virtually translated from all possible exon/intron junction regions. Four types of exon/intron sequence joining types are considered when generating IJPs: intron-exon (I_E_TH, left intron retention junction), exon-intron (E_I_TH, right intron retention junction), knowledgebase validated exon-exon (E_E_KB, exon-exon junctions which can be found in Ensembl transcripts) and theoretical exon-exon (E_E_TH, exon-exon junctions which cannot be found in Ensembl transcripts).
For those exons with the phase information in Ensemble transcript, we directly used the phase to translate the sequence. For those exons without the phase information in Ensemble transcript, we designed an artificial translation method as follows. This sequence is used to generate three peptides, each of which has a different opening reading frame (ORF) and a maximal length of 140 amino acid residues (longer than the longest possible peptide fragments directly obtained from a MS/MS spectrum). The three ORFs are estimated in a validation procedure, where the ORF will be discarded if a stop codon is found in exon, knowledgebase validated exon-exon, or theoretical intron-exon, or if a stop codon is found in the first exon in theoretical exon-exon or theoretical exon-intron.
In the third and final step, we validated each IJP in the ensemble transcript database. Those Ensemble predicted transcripts have been mapped by Ensemble to full-length or near-full-length protein sequence already available in the public sequence databases [13]. We labeled the IJP as knowledge based (_KB) if it can be matched as a substring of any ensemble transcript of the same ensemble gene; otherwise, as theoretic (_TH).
Peptide identification and quantification
Proteins were prepared and subjected to LC/MS/MS analysis. Samples were run on a Surveyor HPLC (ThermoFinnigan) with a C18 microbore column (Zorbax 300SBC18, 1 mm × 5 cm). All tryptic peptides (100 μL or 20 μg) were injected onto the column in random order. Peptides were eluted with a linear gradient from 5 to 45 % acetonitrile developed over 120 min at a flow rate of 50 μL/min, and the eluant was introduced into a ThermoFinnigan LTQ linear ion-trap mass spectrometer. The data were collected in the “triple-play” mode (MS scan, Zoom scan, and MS/MS scan).
We searched the OMSSA against the protein isoform database we created to identify peptide. Peptide quantification was carried out using the LC/MS-based label-free protein quantification software licensed from Eli Lilly and Company. Label-free peptide identification and peptide quantitative analysis services were performed by professionals at the Protein Analysis and Research Center/Proteomics Core of Indiana University School of Medicine, co-located at Monarch Life Sciences, Inc, Indianapolis. For a thorough review of the principle and method developed and used, refer to the review by Wang et al [14]. Briefly, once the raw files were acquired from the LTQ, all extracted ion chromatograms (XIC) were aligned by retention time. Each aligned peak should match parent ion, charge state, daughter ions (MS/MS data) and retention time (within a 1-min window). If any of these parameters were not matched, the peak was disqualified from the quantification analysis. After alignment, the area-under-the-curve (AUC) from individually aligned peak was measured, normalized, and compared for their relative abundance using methods described in [15]. All peak intensities were transformed to a log2 scale before quantile normalization. Peptides with intensity lower than preset quality threshold are marked as present; otherwise, as absent.
Statistical identification of isoform markers
Statistical Significance was measured by a three-step method. First, we conducted a Chi-Square Goodness-of-Fit Test to calculate the p value (also called false discovery rate). Then we calculated the FDR adjusted p value. Last, we calculated the FDR q value using the Storey-Tibshirani method [16]. We chose a significance screening filters (q < 0.05) to select peptides of which we estimated significant differences in the health and breast cancer samples. The False Positive Rate (FPR) or expected proportion of false positive among the proteins with declared changes is FPR = qvalue × number of the proteins with declared changes.
Validation
Four validation methods including biological, statistical, Exon Array and pathway validation methods were used to validate our results. Biological validation was carried out with Explainable Index. For gene, we define “Explainable Index” as
$$ \alpha =\frac{\#co{n}_C+1}{\# in{c}_C+1}\cdot \frac{\#co{n}_H+1}{\# in{c}_H+1}, $$
where #con is the number of consistent peptide markers and #inc is the number of inconsistent peptide markers, C be cancer marker set and H be health marker set. If α > 1, we define the gene to be “more explainable”; and if α ≤ 1, we define the gene to be “less-explainable”.
The “consistent” is defined as one of following three conditions:
-
1)
∃i, j, k:E
j
∊ H&E
i
_E
k
∊ C(i < j < k)
-
2)
∃j:E
j
∊ H and ∄ E
i
_E
k
(i < j < k)
-
3)
∃i, k:E
i
_E
k
∊ C(i < k) and ∄ E
j
(i < j < k);
And the “inconsistent” is defined as one of following three conditions:
-
1)
∃i, j, k:E
j
∊ C&E
i
_E
k
∊ H(i < j < k)
-
2)
∃j:E
j
∊ C and ∄ E
i
_E
k
(i < j < k)
-
3)
∃i, k:E
i
_E
k
∊ H(i < k) and ∄ E
j
(i < j < k).
For statistical validation, we used forward feedback neural network to train Study II and then test Study III. We chose each combination of N (N = 5 for five-marker panel or N = 10 for ten-marker panel or N = 26 for twenty-six-marker panel) out of all the 26 differentially expressed isoforms common in both Study II (90) and B (79) as inputs to the FFNN. The training sets are 40 healthy and 40 cancer samples from Study II. The testing sets are independent 40 healthy and 40 cancer samples from Study III.
For a neural network, the output data has to be transformed into binary or numerical data. A two-variable outcome encoding scheme, i.e., healthy = (0,1), cancer = (1,0) was used. In this scheme, it is theoretically possible to have (1,1) or (0,0) as outcomes although extremely rare. For the two variable outcome encoding scheme, we constructed the input layer as N nodes (corresponding to a N-marker panel outcome), the hidden layer as 7 nodes, and the output layer as 2 nodes.
In order to find the optimal classifier, we presented an optimization method that measures the area under the curve (AUC) for Receiver Operating Characteristics (ROC). In this scheme, we first trained neural network for each combination using Study II results. Then, we measured the AUC for each combination using Study III results for testing. Lastly, the optimal combination C
* was determined by
$$ {C}^{*}=\underset{c}{ \arg \min }AUC\left(NE{T}_C,P\right), $$
where AUC is the area under the ROC curve of neural network’s prediction result, NET is the trained neural network, C is combination of picking N out of the 26 isoforms, and P is the testing set of Study III.
The Exon Array for validation was downloaded from GSE19154 in Gene Expression Omnibus. R and BioConductor libraries were used to perform Exon Array analysis. For pathway validation, the 90 alternative splicing biomarkers in Study II and the 79 alternative splicing biomarkers in Study III were used to perform pathway analysis using the Kyoto Encyclopedia of Genes and Genomes (http://www.genome.ad.jp/kegg/) [17]. Significance level for pathway comparisons was set by hit number >2 due to results of small counts. This allows avoiding any assumptions about the shape of sampling distribution of population.