Genome-wide loss of heterozygosity and copy number alteration in esophageal squamous cell carcinoma using the Affymetrix GeneChip Mapping 10 K array

Background Esophageal squamous cell carcinoma (ESCC) is a common malignancy worldwide. Comprehensive genomic characterization of ESCC will further our understanding of the carcinogenesis process in this disease. Results Genome-wide detection of chromosomal changes was performed using the Affymetrix GeneChip 10 K single nucleotide polymorphism (SNP) array, including loss of heterozygosity (LOH) and copy number alterations (CNA), for 26 pairs of matched germ-line and micro-dissected tumor DNA samples. LOH regions were identified by two methods – using Affymetrix's genotype call software and using Affymetrix's copy number alteration tool (CNAT) software – and both approaches yielded similar results. Non-random LOH regions were found on 10 chromosomal arms (in decreasing order of frequency: 17p, 9p, 9q, 13q, 17q, 4q, 4p, 3p, 15q, and 5q), including 20 novel LOH regions (10 kb to 4.26 Mb). Fifteen CNA-loss regions (200 kb to 4.3 Mb) and 36 CNA-gain regions (200 kb to 9.3 Mb) were also identified. Conclusion These studies demonstrate that the Affymetrix 10 K SNP chip is a valid platform to integrate analyses of LOH and CNA. The comprehensive knowledge gained from this analysis will enable improved strategies to prevent, diagnose, and treat ESCC.


Background
Genetic instabilities are characteristic of most human cancers. Genome-wide detection of chromosomal changes, including loss of heterozygosity (LOH) and copy number alterations (CNA), either gain or loss, are the focus of substantial attention in cancer research. LOH is frequently observed in a variety of human cancers, and regions with frequent LOH may contain tumor suppressor genes. In addition, LOH may associate with the regions affected by haplo-insufficiency of a group of genes. Thus, detection of LOH will likely remain a cornerstone for predicting tumor aggressiveness for many human tumors [1]. Recently, the discovery of large-scale genome-wide copy number variation has stimulated interest in elucidating the role of CNA in the development of malignancy. The 10 K single nucleotide polymorphism (SNP) array (GeneChip Mapping 10 K array, Affymetrix) offers a high-resolution genomic approach to screen chromosomal alterations systematically. Several studies on allelic imbalance or loss in cancers and cancer cell lines using the 10 K SNP array have been published [2][3][4][5][6][7][8][9][10][11][12].
Esophageal squamous cell carcinoma (ESCC) is a common malignancy worldwide and one of the most common malignancies in the Chinese population. There is great geographic variation in the occurrence of this tumor in China, including exceptionally high-risk areas such as Shanxi Province in north central China where some of the highest esophageal cancer rates in the world occur. The standardized incidence rate for esophageal cancer in Shanxi Province is above 100/100,000 person-years, although it appears that both incidence and mortality rates have begun to decline in the past 10 years [13,14]. Within the high-risk regions in China, there is a strong tendency toward familial aggregation, suggesting that genetic susceptibility, in conjunction with environmental exposures, plays a role in the etiology of ESCC. In the past several years, we have tried to identify susceptibility genes and biomarkers that can be used to screen high-risk populations in north central China for ESCC [15][16][17][18][19][20][21][22]. A previous study examined 366 microsatellite markers in a 10 cM density genome-wide scan in 11 ESCC patients, and identified 14 chromosome arms with high-frequency LOH [15]. However, we were unable to further narrow these LOH regions using microsatellite markers due to their low density. Higher density markers are necessary for positional cloning of tumor suppressor genes in LOH regions.
In the present study we established a high-resolution chromosomal instability profile for ESCC by examining germ-line DNA and matched micro-dissected tumor DNA with a 10 K SNP array to determine both LOH and CNA. We also evaluated whether a pool of normal control samples could be used as the normal referent in an LOH study with the 10 K SNP chip instead of matched germ-line DNA.

LOH by patient and chromosomal arms
In the present study, 26 ESCC patients with blood-derived germ-line DNA and matched micro-dissected tumor DNA were investigated using 10 K SNP arrays. The characteristics of these patients are shown in Table 1. The average signal detection rate was higher in germ-line DNA (99%) than that in micro-dissected tumor DNA (79%). Based on NCBI Build 35.1, we summarized characteristics of 11,555 SNPs and mapped them to chromosomes and genes. We first generated a genotyping profile for each patient based on a comparison of the germ-line DNA genotypes to those from the matched micro-dissected tumor DNA. The patients' LOH frequencies, shown in Table 1, ranged from 19% to 95%, and averaged 29%. LOH in four cases (SHE0832, SHE0864, SHE1264, and SHE1490) was performed using DNA from micro-dissected adjacent normal tissue in addition to blood-derived germ-line DNA to see if this affected results, but findings were very similar with both of these two sources of DNA (Table 1).
The frequencies of LOH on each chromosomal arm are shown in Table 2. Non-random LOH was observed on 10 chromosomal arms, including 17p (76%), 9p (72%), 9q (72%), 13q (68%), 17q (66%), 4q (65%), 4p (60%), 3p (58%), 15q (57%), and 5q (52%). Our previous microsatellite marker-based genome-wide LOH scan in 11 ESCC patients with a positive family history of upper gastrointestinal cancer produced overall LOH frequencies that were somewhat higher than the patients evaluated in the present study [15]. We can not explain this between-study LOH frequency variation, but there are several noteworthy differences between the studies that likely influenced LOH rates, including: (i) heterozygosity is higher for the microsatellite compared to the SNP markers examined (~75% versus ~30%); (ii) the total number of markers was much higher in the SNP study than the microsatellite study (11,000 versus 366); and (iii) over twice as many cases were examined in the SNP study (26 versus 11). Some results between studies differed (eg, LOH was ≥ 50% on chromosome 15q in the SNP but not the microsatellite study; LOH was ≥ 50% on 8p, 8q, 11p, 11q and 18p in the microsatellite but not the SNP study). Despite differences in study size, approach, and in some of the results, consistently high LOH frequencies were reported for nine chromosomal arms in both studies (ie, 3p, 4p, 4q, 5q, 9p, 9q, 13q, 17p, 17q). Both studies taken together indicate that LOH on these nine chromosomal arms are the major events associated with genome-wide instability in ESCC in this high-risk Chinese population. These areas are rich in known tumor suppressor genes and oncogenes, including VHL on 3p; NPCA1 on 4p; KIT, GIST, and PDG-

LOH regions
When we used the conservative, traditional approach to LOH in LOH/Model A, we detected 20 LOH regions encompassing a total of 125 SNPs. As shown in Table 3, these 20 LOH regions are located on eight chromosome arms -13q (four regions), 3p (two regions), 4q (three regions), 9p (two regions), 9q (three regions), 17p (three regions), 17q (two regions), and 4p (one region). The size of these LOH regions ranged from 10 kb to 4.26 Mb (average 1.44 Mb); genes involved in these deletion regions are shown in Table 3. Among the 125 SNPs in these 20 LOH regions, 46 are located in genes (one in a coding exon, 39 in introns, and six in 3'-or 5'-UTRs), and 79 are located in regions flanking genes (ie, within 1 kb). One SNP (rs781852) is located in the coding region of gene ZZEF1 (Zinc finger, ZZ-type with EF-hand domain 1) on chromosome 17p13.2. Allele A for this SNP encodes an amino acid proline (Pro) and the allele B encodes amino acid leucine (Leu). Eight of 10 heterozygous cases (Pro/Leu) showed LOH (80%), including five cases that lost allele B and three cases that lost allele A. The 46 SNPs that are located within genes map to 32 genes and include four SNPs in the introns of ZNF618, and two SNPs each in the introns of ITPR1, FLJ14834, LHFP, ITGAE, MYH3 and MYOCD. Some of these 20 deletion regions have been previously identified by our lab and others [17,22]. How-ever, the current study provides far greater precision in locating LOH regions (10 kb-4.26 Mb as opposed to 10 cM, which corresponds to 5-10 Mb). As expected, using a less conservative definition for LOH, LOH/Model B detected more regions (and SNPs) than our approach in LOH/Model A -72 LOH regions containing 2,916 SNPs. The distribution of deletion regions and details from this model are shown in Table 4 and Additional Table 1 (in additional file 1).
Our cLOH data in cLOH/Model A identified only three significant cLOH regions. These included one on 13q12-q13 and two on 13q13, and encompassed a total of 30 SNPs. The sizes of these cLOH regions are 1.9 Mb, 0.4 Mb, and 0.2 Mb, respectively (average 0.83 Mb) ( Table 5). The less conservative cLOH/Model B highlighted 64 cLOH regions with 2,128 SNPs; details are shown in Table 6 and Additional

Comparison of LOH and cLOH regions
Our conservative LOH/Model A detected 20 LOH regions including 125 SNPs, but our conservative cLOH/Model A detected only three LOH regions containing 30 SNPs. The detection of only three LOH regions by cLOH/Model A is not unexpected since identifying an LOH region in a sample requires the presence of multiple homozygous SNPs     Table 7 shows 15 regions with CNA losses that were detected at P ≤ 10 -6 . These include regions on 1p, 3p, 4q, 5q, 9p, 10p, 11p, 11q, 13q, and 18q. One-hundred and two SNPs were mapped within these regions (Table 4A). Details of the involved SNPs and genes are shown in Additional Table 3 (in additional file 1). Table 8 shows the 36 regions where significant CNA gains were identified, including eight on chromosomal arm 3q, seven on 8q, three on 7p, two on 5q, two on 14q, and two on 22q (Additional Table 4 in additional file 1). Examples of whole genome profiles of CNA regions are shown for chromosomes 3, 7, and 8 in Figures 4, 5, 6.

Comparisons between LOH and CNA
We obtain both cLOH and CNA data when we use the pooled normal control sample reference in the CNAT software. Thus we can ask the question of whether the cLOH is associated with CNA. Our studies showed that among 2,128 SNPs identified in our less conservative cLOH/ Model B, only 45 (2%) showed CNA loss and just 14 (0.7%) showed CNA gain (Figure 7). This result suggests that CNA accounts for small percent of LOH events in ESCC. LOH in cancers is commonly caused by one of three different mechanisms. The first and most common cause of LOH is mitotic recombination [3]. This mechanism doesn't change chromosome copy number, and was responsible for 97% of the LOH observed in our study. Deletion, the second cause of LOH, should result in copy number loss, and occurred in approximately two percent of LOH in our study. Finally, LOH can result from amplification of one chromosome, which should show copy  [8]. These differences might reflect genuine differences between tumor types, the lab analytic methods used, or different operative mechanisms at work.
The genome-wide LOH and chromosome copy alteration studies described in this paper can also be applied to higher density SNP chips, such as Affymetrix 100 K and 500 K SNP chips. The increased SNP density will allow even finer mapping of these genetic changes.
In summary, we performed a genome-wide study of LOH and CNA in ESCC patients using the Affymetrix 10 K SNP chip by comparing matched germ-line and tumor DNA. Our approach allowed us to extensively map both LOH and CNAs in ESCC systematically in a manner that has not heretofore been done, and produced numerous regions, genes, and SNPs that merit future exploration. This report is the first comprehensive genome-wide analysis of chromosomal imbalance (LOH and CNA) in ESCC, and the knowledge gained from this analysis will enable the development of improved strategies to prevent, diagnose, and treat ESCC patients in the future.

Conclusion
The Affymetrix 10 K SNP chip is a valid platform to integrate analyses of loss of heterozygosity and copy number alterations. The comprehensive knowledge gained from this analysis will enable improved strategies to prevent, diagnose, and treat esophageal squamous cell carcinoma.

Biologic specimen collection and processing
Ten milliliters of venous blood was taken from each patient prior to surgery and germ-line DNA was extracted and purified using standard methods. Tumor and adja- cent normal tissue obtained during surgery were either fixed in ethanol and embedded in paraffin, or snap frozen in liquid nitrogen and stored in a freezer at -80°C until used. Slides were stained with H&E to distinguish tumor from normal epithelium, and tumor cells were micro-dissected under light microscopy using either laser capture micro-dissection (LCM) (for paraffin-embedded samples) or manual dissection (for frozen samples). All micro-dissections were performed by a pathologist (NL) and a trained post-doctoral fellow (HS). Extraction of LCM DNA was previously described [17,23]. Extraction of manually micro-dissected DNA followed the protocol from the Puregene DNA Purification Tissue Kit (Cat Number D-7000A, Gentra Systems, Inc., Minneapolis, MN 55441, USA).
It is well known that using pure tumor DNA obtained by micro-dissection is key to successfully detecting chromosomal changes such as LOH and CNA. However, the 10 K chip requires amplification of DNA fragments up to 1 Kb, a particularly challenging task. It is usually difficult to obtain a high yield of DNA from alcohol-or formalinfixed tissues, especially when using micro-dissection. In our study, the SNP call rates were much lower in microdissected tumor DNA from alcohol-fixed tissue than from frozen tissue (data not shown). Although the isolation of tumor DNA from ground tissue using Trizol yielded higher genotype call rates, we think that it is more important to identify LOH and CNA regions than to simply obtain higher genotype call rates. Thus, the best overall genomic characterization results can be expected from the use of micro-dissected frozen tissue.

Affymetrix GeneChip Mapping 10 K array
The 10 K SNP array provides comprehensive coverage of the genome for genotyping studies. Each array contained 11,555 bi-allelic polymorphic SNPs randomly distributed throughout the genome, except for the Y chromosome. The median physical distance between SNPs is approximately 105 kb, and the mean distance between SNPs is 210 kb. The average heterozygosity for these SNPs is 0.37, with an average minor allele frequency of 0.25. The algorithm used to make genotype calls was previously described by Affymetrix [24,25]. DNA samples were assayed according to the protocol (GeneChip Mapping Assay manual) supplied by Affymetrix, Inc. (Santa Clara, CA) as previously described [25,26]. The 10 K SNP arrays were scanned with the Affymetrix GeneChip Scanner 3000 using GeneChip Operating System 1.2 (GCOS) (Affymetrix). Data files were generated automatically. Genotype assignments (ie, calls) were made automatically by GeneChip DNA Analysis Software 3.0 (GDAS) (Affymetrix). The genetic map used in the analysis was obtained from GeneChip Mapping 10 K library files: Mapping 10K_Xba131. "Signal Detection Rate" is the percentage of SNPs that passed the discrimination filter. "Call Rate" is the percentage of SNPs called on the array. Genotype calls are defined as AA, AB, or BB; "no call" means the  SNP for that sample did not pass the discrimination filter and was excluded from further evaluation in the present study.

Data analysis
Since patient-matched normal DNA is not always available as a reference for high-resolution allelotyping, we evaluated LOH using two different methods: first, we used patient-matched normal DNA as the reference (the traditional approach); and second, we assessed whether it was possible to instead use a pool of normal control samples as the reference, as is done with the chromosome Copy Number Analysis Tool 2.0 software (CNAT) from Affymetrix.
In the first method, LOH was defined in a traditional manner as a change in genotyping call from heterozygosity (AB) in the germ-line DNA, to homozygosity (AA or BB) in the matched micro-dissected tumor DNA (all calls from GDAS). In the second method, LOH was also defined as a change in genotyping from "normal" to tumor, however, "normal" here was defined based on data already present in the Affymetrix CNAT software from prior testing of 100 ethnically-diverse normal refer- ence subjects [27]. LOH in the second method was based on a comparison of a track of contiguous SNPs in tumor to the analogous track of contiguous SNPs in the "normal" population DNA. Since the "normal" DNA here includes not just one but 100 individuals, the state of these SNPs (ie, whether they are heterozygous or homozygous) was inferred statistically as a likelihood estimation with confidence calculated from a binomial distribution of the observed state of these SNPs in this normal population. A contiguous run of homozygous SNPs in tumor where these SNPs are heterozygous in the "control" suggests LOH in the region spanning the SNPs. Hence, no germ-line DNA data from cases was used for this second analysis. We refer to this LOH as "cLOH" to distinguish it from our more traditional analysis approach using paired normal and tumor samples and to indicate that it utilized a common control pool of normal DNA generated by CNAT. The threshold for statistical significance used in CNAT was P ≤ 10 -6 as recommended by Affymetrix [27].
We combined the LOH results from a cluster of SNPs in a genetic locus to define a deletion region. We defined these deletion regions in several ways to permit comparison with the existing scientific literature as well as to make comparisons within our own study using different refer-  The second approach we took (labeled "LOH/Model B") was very similar to LOH/Model A in that we used the same constraints on the SNPs noted above, but we were less conservative in our requirement for the percent of the deletion region which showed LOH -only ≥ 50% LOH frequency (instead of ≥ 75%) among the SNPs was required to be classified as a deletion region. To enable comparability with data from the cLOH approach  We also developed two models using exclusively data from the cLOH approach with the CNAT-generated pooled controls. The first used the conservative definition described above for LOH/Model A of a ≥ 75% LOH requirement to declare a deletion region, and also treated homozygous SNPs in deletion regions in accord with the CNAT algorithm described above; we termed this "cLOH/ Model A". Although the level of LOH required is the same for LOH/Model A and cLOH/Model A, direct comparisons between them are not possible because of the different algorithms used to treat uninformative SNPs in deletion regions. The second approached loosened the LOH requirement to ≥ 50% to declare a deletion region (as with LOH/Model B above), also used the CNAT algorithm for homozygous SNP calls in deletion regions, and was termed "cLOH/Model B".
Individual SNP copy numbers and chromosomal regions with gains or losses were also determined by evaluation with CNAT based on the SNP hybridization signal intensity data from the experimental sample relative to intensity distributions derived from the previously described reference set containing over 100 normal individuals [27]. P-values were log 10 -transformed and plotted along the corresponding chromosome; values were considered significant at P ≤ 10 -6 .
We further defined CNA-gain regions as regions where five or more contiguous SNPs showed copy number gain in at least 50% of cases, and the P-value for the difference from the reference was ≤ 10 -6 . Similarly, CNA-loss regions were defined as regions where five or more contiguous SNPs showed copy number loss in at least 50% of cases, and the P-value for the difference from the reference was ≤ 10 -6 .

Abbreviations
SNP, single nucleotide polymorphism; ESCC, esophageal squamous cell carcinoma; LOH, loss of heterozygosity; CNA, copy number alteration; LCM, laser-capture microdissection; CNAT, copy number analysis tool; GCOS, GeneChip Operating System; GDAS, GeneChip DNA Analysis Software; cLOH, LOH determined by comparing patient tumor DNA with common control pool of DNA; NCBI, National Center for Biotechnology Information

Authors' contributions
NH conceived of study, oversaw laboratory and statistical analysis, and drafted the manuscript. CW, NL, and HS performed the laboratory analyses and conducted initial statistical analyses. YH and HHY directed final statistical analyses. L-HK and Q-HW oversaw the data and sample collection procedures for the study. AMG conceived of the study, participated in statistical analyses, and revised the manuscript. KHB provided general statistical and scientific guidance for the study. ME-B conceived of the study, aided in interpretation of the data, and revised the manuscript. PRT conceived of the study, oversaw laboratory and statistical analyses, revised the manuscript, and obtained funding. MPL conceived of the study, oversaw statistical analyses, and revised the manuscript. All authors read and approved the final manuscript. Comparison of SNPs with cLOH (from our less conservative "cLOH/Model B") and CNAs (from CNAT software using pooled DNA from normal controls) Figure 7 Comparison of SNPs with cLOH (from our less conservative "cLOH/Model B") and CNAs (from CNAT software using pooled DNA from normal controls).