A new analysis tool for individual-level allele frequency for genomic studies

Background Allele frequency is one of the most important population indices and has been broadly applied to genetic/genomic studies. Estimation of allele frequency using genotypes is convenient but may lose data information and be sensitive to genotyping errors. Results This study utilizes a unified intensity-measuring approach to estimating individual-level allele frequencies for 1,104 and 1,270 samples genotyped with the single-nucleotide-polymorphism arrays of the Affymetrix Human Mapping 100K and 500K Sets, respectively. Allele frequencies of all samples are estimated and adjusted by coefficients of preferential amplification/hybridization (CPA), and large ethnicity-specific and cross-ethnicity databases of CPA and allele frequency are established. The results show that using the CPA significantly improves the accuracy of allele frequency estimates; moreover, this paramount factor is insensitive to the time of data acquisition, effect of laboratory site, type of gene chip, and phenotypic status. Based on accurate allele frequency estimates, analytic methods based on individual-level allele frequencies are developed and successfully applied to discover genomic patterns of allele frequencies, detect chromosomal abnormalities, classify sample groups, identify outlier samples, and estimate the purity of tumor samples. The methods are packaged into a new analysis tool, ALOHA (Allele-frequency/Loss-of-heterozygosity/Allele-imbalance). Conclusions This is the first time that these important genetic/genomic applications have been simultaneously conducted by the analyses of individual-level allele frequencies estimated by a unified intensity-measuring approach. We expect that additional practical applications for allele frequency analysis will be found. The developed databases and tools provide useful resources for human genome analysis via high-throughput single-nucleotide-polymorphism arrays. The ALOHA software was written in R and R GUI and can be downloaded at http://www.stat.sinica.edu.tw/hsinchou/genetics/aloha/ALOHA.htm.


Background
Allele frequency denotes the relative frequency of an allele compared with the total frequency of all alleles at a marker locus. It is one of the most important population indices and has been broadly applied to genetic/ genomic research [1][2][3][4][5]. The generalized concept of allele frequency has two aspects: individual-level allele frequency and population-level allele frequency. The former represents a within-individual relative frequency of alleles and its standard error reflects inter-cell variability in an individual; the latter represents a within-population relative frequency of alleles and its standard error reflects inter-individual variability in a population [6].
Using the most abundant genetic marker in the human genome, namely the single-nucleotide polymorphism (SNP), both individual-level and populationlevel allele frequencies can be estimated using a (genotype-based) allele-counting approach and an (intensitybased) intensity-measuring approach (Appendix A). This paper focuses on the intensity-measuring allele frequency because this allele frequency estimate is insensitivity to genotyping errors and preserves data information that might be lost in conventional genotype-based analyses. For instance, the genomic abnormality of a particular sample, for instance NA18996 from the Japanese in population in the International HapMap Project [7][8][9][10], can be easily observed by individual-level allele frequencies using an intensity-measuring approach [Additional file 1, Supplemental Figure S1 (A)] but not by using an allele-counting approach [Additional file 1, Supplemental Figure S1 (B)]. The finding motivated us to analyze intensity-based allele frequencies for extraction of information lost in conventional genotype-based analyses.
An intensity-measuring approach can accurately estimate allele frequencies with the aid of an adjustment for preferential amplification/hybridization [11]. The coefficient of preferential amplification/hybridization (CPA) is used to quantify preferential amplification/hybridization and reduce the estimation bias of allele frequency [12][13][14][15][16][17][18][19]. Consider a SNP with genotype AA, Aa or aa.  Figure S3 (B)] [6]. The results demonstrate the important role of CPA adjustment in allele frequency estimation, thereby expanding applications of allele frequency in various genomic science disciplines. The features of CPA have been studied, but not exhaustively [14,15,17]. Therefore, we undertook an in-depth investigation of the relationship between CPA and important factors including sample size, time of data acquisition, effects of laboratory site, type of gene chip, ethnicity effects, and phenotypic status, and we constructed new public CPA and allele frequency databases.
Population-level allele frequency has many important applications and has been well discussed [1][2][3][4]. This paper mainly focuses on individual-level allele frequency and investigates its applications in genetic/genomic studies, including the discovery of allele frequency patterns, identification of chromosomal aberrations (including aneuploidy, loss of heterozygosity (LOH), and allelic imbalance (AI)), outlier detection, and sample classification (including stratification by population and gender). This is the first time that these important genetic/genomic applications have been simultaneously conducted based on individual-level allele frequencies estimated by a unified intensity-measuring approach.

Samples
This study analyzes genotype and intensity data from several large genomic projects. The first dataset consists of 367 and 448 Taiwanese samples genotyped using the Affymetrix Human Mapping 100K Set and 500K Set (Affymetrix, San Diego, CA, USA), respectively, from the Taiwan Han Chinese Cell and Genome Bank [20]. The second dataset consists of 175 and 198 hypertension patients genotyped using the Affymetrix Human Mapping 100K Set and 500K Set, respectively, from the Academia Sinica Multi-Centered Hypertension Genetic Study. The third dataset consists of 30 African trios of Yoruba in Ibadan (YRI), 30 Caucasian trios of residents in CEPH Utah (CEU), and 90 Asians (45 Han Chinese in Beijing (CHB) and 45 Japanese in Tokyo (JPT)) from the International HapMap Project [7][8][9][10], where all 270 samples were genotyped using both the Affymetrix Human Mapping 100K Set and 500K Set. The fourth dataset consists of 242 and 304 leukaemia cancer patients who were genotyped using the Affymetrix Human Mapping 100K Set and 500K Set, respectively; in addition, 50 normal controls were genotyped with the Affymetrix Human Mapping 100K and 500K Sets from an acute lymphoblastic leukaemia (ALL) project [21,22]. All participants involved in the genomic projects signed Informed Consent Forms.

Genotyping
All samples were genotyped with the Affymetrix Human Mapping 100K Set and/or 500K Set (Affymetrix, San Diego, CA, USA), which contain 116,204 SNPs with a median inter-marker distance of 8.5 kb and 500,568 SNPs with a median inter-marker distance of 2.5 kb, respectively. The data sheets and genotyping manuals for the two SNP chips can be downloaded at http:// www.affymetrix.com/. For details of the SNP genotyping experiments refer to Yang et al. [23] for the Taiwanese normal samples and hypertensive patients, The International HapMap Consortium [7][8][9][10] for the International HapMap Project samples, and Mullighan et al. [21,22] for the ALL Project samples. SNP genotype calling algorithms, DM (Dynamic Model) [24] and BRLMM (Bayesian Robust Linear Model with Mahalanobis Distance Classifier) [25], were used for the Affymetrix Human Mapping 100K Set and 500K Set, respectively.

Allele frequency estimation
An individual-level allele frequency is the proportion of a specific allele in a genotype. Here we formulate the existing procedures of allele frequency estimation. Consider an example of a SNP with genotype aa, Aa or AA. Two methods can be used to estimate an individuallevel allele frequency. First, the allele-counting approach uses genotype data from an individual genotyping experiment (IGE). The number of allele A is counted and then used to calculate the proportion of allele A in a genotype. Besides no calls, there are only three possible outcomes of an allele frequency estimate for a SNP. Therefore, allele frequency estimates are 0, 0.5 and 1, corresponding to genotypes aa, Aa and AA, respectively (Equation (A1)). Second, the intensity-measuring approach uses intensity data from an IGE. The frequency of allele A is estimated by calculating the ratio of intensities pertaining to allele A relative to the total intensity of two alleles, where intensities are adjusted by considering a CPA. The ratio reflects a relative amount of allele A compared with the total amount of two alleles at a SNP for an individual (Equation (A2)).
In contrast to an individual-level allele frequency, a population-level allele frequency is the proportion of a specific allele in a study population. The allele frequencies can be estimated based on data from an IGE or a pooled allelotyping experiment (PAE). In an IGE, individual-level allele frequencies are estimated by using genotype data or intensity data as mentioned above. Then a population-level allele frequency is estimated by taking an average over individual-level allele frequencies from genotype data (Equation (A3)) or intensity data (Equation (A4)). In a PAE, intensity data are available, but individual genotype data are not. A relative intensity of allele A in a DNA pool, which is constructed by mixing genomic DNA from multiple samples, is calculated to estimate a population-level allele frequency (Equation (A5)). The detailed procedures for allele frequency estimation and CPA adjustment are described in Appendix A.

Identification of chromosomal aberrations
We develop multiple chromosomal aberration indices and a sliding-window approach in concert with a standard individual-level allele frequency plot to identify chromosomal aberrations such as aneuploidy, AI, LOH, long-contiguous-stretch-of-homozygosity (LCSH), and so on. For an individual and a SNP, individual-level allele frequency is compared with genotype-specific reference confidence intervals. Index A i m , is used to detect AI genomic segments characterized by SNP points with allele frequencies outside the allele frequency confidence intervals of three genotypes. Index , are also calculated using a spline smoothing technique. We identify chromosomal aberrations by pinpointing the genomic regions where the indices of a patient are higher than the 95%-quantile indices of normal controls. Detailed procedures for identifying chromosomal aberrations are described in Appendix B.

Sample classification and outlier detection
We apply an allele frequency biplot based on individuallevel allele frequencies to classify samples and detect outliers. An allele frequency biplot, which uses a singular value decomposition to decompose an allele frequency matrix into a sample matrix and a SNP matrix (Equations (C1) and (C2)), projects samples and SNPs onto a two-dimensional plane simultaneously. The first dimension is a sample coordinate constructed by the first two columns of the sample matrix, and the second dimension is a SNP coordinate constructed by the first two columns of the SNP matrix. The rank-2 biplot configuration can be utilized for sample classification, outlier detection and SNP clustering. First, sample classification allows for samples with a similar allele frequency distribution to be clustered and used to study population stratification and gender grouping. Second, outlier detection identifies samples far away from the majority of samples for a further examination prior to downstream analyses. Third, SNP clustering identifies a collection of SNPs physically close to specific sample groups. The SNP patterns are used to separate sample groups and explain sample characteristics. Detailed procedures for constructing an allele frequency biplot are described in Appendix C.

Coefficient of preferential amplification/hybridization
We characterize genomic patterns of CPA based on the three genome projects, the Taiwan Han Chinese Cell and Genome Bank [20], the Academia Sinica Multi-Center Hypertension Genetic Study, and the International HapMap Project [7][8][9][10]. Both Affymetrix 100K and 500K Sets are considered. This work extends our previous study by using richer study samples and denser SNP chips [17]. We calculate CPA based on an unbiased estimator [12] and construct public CPA databases for the Taiwanese population, ethnic-specific populations (African, Asian, and Caucasian populations), and a cross-ethnicity group (a combination of African, Asian and Caucasian populations). The CPA databases are available online at http://140.109.72.48/index.htm. Three CPA query methods (keyword query, general query and advanced query) are provided. In addition, we examine the relationship between CPA and important factors including: (1) sample size (45, 90, 180, and 367 individuals); (2) time of data acquisition (four genotyping periods) and effects of laboratory site (two sites); (3) gene chip type (Affymetrix Human Mapping 100K Set and 500K Set); (4) ethnicity effects (Africans, Asians, and Caucasians); and (5) phenotypic status (normal controls and young-onset hypertension patients). Results of the investigation of these issues are summarized in order they are presented here.
First, CPA can be estimated well for a moderate sample size [Additional file 4, Supplemental Figure S4 (A)], but its variance decreases significantly with increasing sample size [Additional file 4, Supplemental Figure S4 (B)]. Based on a total of 367 Taiwanese samples that were genotyped with the Affymetrix Human Mapping 100K Set, 45, 90 and 180 samples are randomly selected from the 367 samples. Pairwise correlation coefficients of log 2 (CPA) obtained from different sample sizes are calculated, and the lowest correlation coefficient is greater than 0.975. In the fitted quadratic regression, an intercept term is close to 0 (i.e., the regression curve passes through the origin), and the regression coefficients of the linear term and quadratic term are close to 1 and 0, respectively. In other words, CPAs are close even when they are calculated from a relatively small sample size, implying that CPA can be estimated well even for a moderate sample size [Additional file 4, Supplemental Figure S4  Finally, CPA distributions between hypertension patients and normal controls are similar [Additional file 9, Supplemental Figure S9]. CPAs are calculated based on a total of 180 Taiwanese normal control samples and 175 Taiwanese hypertension patient samples that were genotyped with the Affymetrix Human Mapping 100K Set in the same laboratory. A high correlation (r = 0.980) is observed. This result implies that the pooled DNA association mapping of a Taiwanese hypertension study may use the same CPA for case and control groups.

Allele frequency
Based on our CPA databases, we estimate individuallevel allele frequencies using the intensity-measuring approach for samples from different populations, including (1) -2007), with the exception of some noisy SNP points, an allele frequency plot exhibits three bands formed by SNPs with an individual-level allele frequency close to 0, 0.5 or 1 ( Figure 1 and Equation (A2)). Allele frequency plots of cancer patients with chromosomal aberrations will be shown later in Identification of Chromosomal Aberrations. In addition, we establish individual-level allele frequency databases that provide SNP annotation, allele frequencies and the summary statistics for the HapMap samples for both the Affymetrix Human Mapping 100K and 500K Sets. The databases are available online at http://140.109.72.48/index.htm.
Three allele frequency query methods (keyword query, general query and advanced query) are provided. The individual-level allele frequency estimates and databases are applied to genomic pattern detection, chromosomal aberration analysis, classification analysis, and outlier detection.

Identification of chromosomal aberrations
Individual-level allele frequency can be used to identify chromosomal aberrations, such as aneuploidy, LOH, AI, and so on, which contribute to the complex genomic profiles of many cancer patients. Based on our constructed databases of CPA and individual-level allele frequency, we apply the developed sliding-window LOH and AI detectors to identify chromosomal aberrations for the 242 and 304 ALL patients that were genotyped with the Affymetrix Human Mapping 100K Set and 500K Set, respectively, in the ALL Project. Two examples are given here to demonstrate whole-chromosome aberration and segment aberration. The first example is a male leukaemia patient with hyperdiploidy (Hyper-dip50-SNP-#27). His genome exhibits AI on chromosomes 6, 8, 10, 14, 17, and 18 and LOH on chromosomes 2, 19 and 22 (Figure 2 and Appendix B). The second example is a male T-cell ALL patient (T-ALL-SNP-#49). Three chromosomal regions, 2p, 6q, and 9p, of AI are identified, with chromosomes 6q and 9p also show LOH (Figure 3 and Appendix B).

Group classification and outlier detection
Based on individual-level allele frequencies, an allele frequency biplot is developed to classify samples belonging to different ethnic groups and gender groups and identify outlier samples.
First, in an unsupervised cluster analysis of the International HapMap Project data, the majority of samples from African (YRI), Caucasian (CEU) and Asian (CHB and JPT) populations are clearly classified into three groups in an allele frequency biplot (Figure 4 and Appendix C). Furthermore, samples from Chinese (CHB) and Japanese (JPT) populations are also accurately classified as two distinct groups ( Figure 5 and Appendix C). The results illustrate that individual-level allele frequencies are not only useful to distinguish the genetic differences of the Asian, Caucasian and African populations, but also able to distinguish the small differences between Asian subpopulations.
Second, allele frequencies can be used to judge the gender of a person from which a sample was obtained. Genetically, a female has two X chromosomes and a male has only one X chromosome. High homozygosity on the X chromosome should generally be observed in a male sample [Additional file 2, Supplemental Figure S2 (B) for a male and Figure 1 for a female]. Our analysis shows that an allele frequency biplot for the X chromosome can classify the gender of samples donors [Additional file 10, Supplemental Figure S10].
Third, allele frequency biplots can be used to detect outliers. For example, a JPT sample (NA1902) is some distance away from all other samples in a biplot of chromosome 2 ( Figure 5) because this sample has a LCSH on this chromosome.

Discussion
We apply individual-level allele frequencies, estimated using an intensity-measuring approach, to genetic and genomic studies. Under the framework, statistical inferences are drawn based on intensity data instead of genotype data. Compared with a genotype-based analysis that estimates allele frequencies via an allele-counting approach, our method has three advantages. First, the method can analyze data from DNA samples reflecting clonal heterogeneity and containing DNA from contaminating "normal" cells, which is often the case in cancer studies. Second, the method can be used to analyze data from mixed DNA of multiple samples for cost savings. Third, the method is insensitive to genotype calling errors, which frequently occur in aberrant chromosomal regions, such as segments with aneuploidy, LOH, AI, and so on. The performance of the methods developed for amplification/hybridization calibration, allele frequency estimation, group classification, outlier detection, and genomic aberration identification are examined comprehensively based on the data from large-scale genomic projects.
The proposed methods are not restricted to the Affymetrix GeneChip, but also are adapted to the Illumina BeadChip. For example, the individual-level allele frequency plot of a normal sample genotyped with the Illumina HumanHap550-Duo BeadChip also yields the expected three-band pattern similar to what is obtained with the Affymetrix GeneChip [Additional file 11, Supplemental Figure S11]. In addition to applications for individual-level allele frequency, the proposed intensitymeasuring approach can be easily applied to study population-level allele frequency. For example, in a pooled allelotyping experiment [2,[26][27][28][29][30][31], individual genotype information is not available, thereby the allelecounting approach fails to be applied; the proposed intensity-measuring approach, however, works well in this type of experiment (Appendix A). Allele frequency estimates of the pooled allelotyping experiment of a pool size of 240 are quite accurate compared to the true answers obtained from the individual genotyping experiment of all 240 samples (r = 0.983) [Additional file 3, Supplemental Figure S3 (B)]. Pooled DNA is constructed by mixing DNA from multiple individuals; therefore, allelotyping of pooled DNA provides a costsaving alternative to an individual genotyping experiment and has been recognized as an efficient tool for identification of polymorphisms and mutations, homozygosity mapping, genetic association, and so on.
We have constructed large ethnicity-specific and cross-ethnicity CPA databases. Compared with our previous database [17], the new databases are improved by increased sample size and the addition of a dense SNP chip platform. Researchers may use our free database, thereby curtailing study costs. The characteristics of CPA extracted from our databases are summarized as follows. First, CPA variance decreases significantly as sample size increases; thus, a CPA database should be based on a large number of samples. Second, CPA is independent of the time and laboratory where the genotyping took place, and therefore the CPAs provided in our databases are applicable to studies carried out at different times and places. Third, the 100K and 500K gene chips yield similar CPA patterns. Fourth, discrepancies of CPAs in ethnic groups discourage the direct use of CPA from different ethnic groups. Fifth, most SNPs have a strongly positively correlated genomic distribution of CPA between phenotypic groups of hypertension. Sixth, log-normal distributions capture the genomic distributions of CPA well, and CPA is associated with GC content and genotype. These findings broaden our understanding of preferential amplification/ hybridization and will benefit a wide variety of genetic/ genomic studies. In addition, we also constructed a database of individual-level allele frequency, which provides useful materials for genetic/genomic research discussed in this paper.
Interestingly, when compared to the result of DNA copy number changes in T-ALL-SNP-#49 reported by Mullighan et al. [21], AI of chromosome 2p correlates well with amplification, whereas AI and LOH of chromosomes 6q and 9p are the result of deletions. These results demonstrate the feasibility of applying individuallevel allele frequencies to identify chromosomal aberrations. Our method can be jointly used with some of the existing copy number analysis tools such as LB [32], PennCNV [33], QuantiSNP [34], dCHIP [35], and Gen-oCN [36] to move toward a more complete understanding of the mechanisms underlying chromosomal abnormalities.
In addition to identification of chromosomal aberrations, individual-level allele frequency can also be used to identify an occurrence of cell mixture. JPT sample NA18996 is an example with "abnormal" genomic patterns [Additional file 1, Supplemental Figure S1 (A)] that obviously deviate from the three-band pattern of a normal sample [Additional file 2, Supplemental Figure  S2 (B)]. The individual-level allele frequency plot of sample NA18996 shows that the majority of this individual's genome have five allele frequency bands with the exception that some autosomal regions have two or three bands and the X chromosome has four bands, indicating that this sample may reflect a mixture of DNA samples. In addition, we show that individual-level allele frequency can also be used to estimate the level of contaminating "normal" cells (Appendix D). In the example of T-ALL-SNP-#49, the average allele frequencies (f 1 ) of the abnormal regions of 6q and 9p were as follows: f 1 = 0.192 for 6q, and f 1 = 0.190 for 9p, implying an estimated range of contaminating "normal" cells of 0.235 -0.250, which is close to the reported sample purity (blast percentage = 76%) reported by Mullighan et al. [21].
Our analyses show that individual-level allele frequencies and allele frequency biplots are useful for distinguishing genetic differences among ethnic populations. On the one hand, the method can be applied to examine population stratification, which is one of the most common factors underlying false positives in case-control association studies, through the identification of subgroups. On the other hand, the method is potentially useful to assign ethnicity for individuals with unclear ancestry information by superimposing a new sample point to the established ethnic groups using an allele frequency biplot, or to assign unknown forensic individuals to pre-study groups. Compared with a principal component analysis that can also be applied to study sample classification, an allele frequency biplot provides a more detailed visual appraisal of the relationship between SNPs and samples (Appendix C).

Conclusion
Allele frequency has been studied for many years. With the aid of whole-genome SNP chip technology, highthroughput SNP data have become available. Based on the individual-level allele frequencies obtained from the unified estimation procedure, we developed new methods/software/databases to extract hidden genetic information and broaden the potential application of allele frequency to genomic research. Our new analysis tool, ALOHA (Allele-frequency/Loss-of-heterozygosity/ Allele-imbalance), written in R and R GUI, provides for genome-wide analysis of allele frequency and detection of both LOH and AI. An allelefrequency biplot is also provided for sample classification, outlier detection, and SNP clustering. The software can be downloaded at http://www.stat.sinica.edu.tw/hsinchou/genetics/aloha/ ALOHA.htm. New allele frequency and CPA databases have been created. Functions for data query modes (keyword query, general query, and advanced query) and data downloads are provided. The databases are available at http://140.109.72.48/index.htm. We expect that additional practical applications of allele frequency will be found and allele frequency analysis will continue to play a key role in genetic/genomic research.

Appendices Appendix A. Allele frequency estimation
We summarize the estimation procedures for two types of allele frequencies, individual-level and populationlevel allele frequency, in this appendix. This summary focuses on the most abundant genetic marker, the SNP, in the human genome. SNP data can be generated from two types of SNP typing experiments, namely an individual genotyping experiment (IGE) and a pooled allelotyping experiment (PAE), which can be carried out using a customized platform (e.g., MALDI-TOF mass spectrometry) or a genome-wide platform (e.g., Affymetrix and Illumina SNP chips).
An IGE collects intensity data (e.g., mass spectrometry peak intensity provided by MALDI-TOF mass spectrometry or hybridization intensity provided by Affymetrix SNP chips) and then uses a SNP calling algorithm to assign genotype calls to SNPs based on intensity data. Both individual-level and population-level allele frequencies can be estimated by using an allele-counting approach based on genotypic data and an intensity-measuring approach based on intensity data.
In contrast to an IGE, which analyzes DNA from each individual, a PAE analyzes mixed DNA from multiple individuals. With PAE it is difficult to extract genotype information for individuals although the same genotyping platform is employed for PAE as IGE. A PAE collects intensity data but cannot determine individual genotypes, and is therefore unable to estimate individual-level allele frequency; nevertheless, a PAE provides a cost-efficient way to estimate population-level allele frequency by using an intensity-measuring approach.
We introduce the estimating procedures as follows. Suppose that there are n independent individuals in total. Consider a SNP with genotype AA, Aa or aa. Let (p i ,i = 1,...,n) denote individual-level allele frequencies for n individuals and p denote a population-level allele frequency of allele A. In an IGE, we collect two types of data for each individual: (1) genotypes (G i ,i = 1,...,n), and (2) pairs of intensities of two alleles {(S 1i ,S 2i ),i = 1,..., n}. In a PAE, we collect a pair of intensities of two alleles (S 1 , S 2 ) in a DNA pool with a pool size of n.
First, we introduce the estimating procedures for an individual-level allele frequency, which is a proportion of a specific allele in a genotype. An individual-level allele frequency can be estimated using an allele-counting approach based on genotypic data and an intensity-measuring approach based on intensity data from an IGE.
The allele-counting approach uses genotype data from an IGE. The number of allele A is counted and then used to calculate the proportion of allele A in a genotype. Under the assumptions of Hardy-Weinberg equilibrium and random sampling, the formulae of the estimator of allele frequency of allele A and its expectation and variance can be written as follows: The intensity-measuring approach uses "adjusted" intensity data from an IGE, where intensities are "adjusted" by considering a coefficient of preferential amplification/hybridization (CPA). Based on intensity data from individuals who are heterozygous at the SNP, CPA can be estimated as follows [6,12,17]: S S 1 2 are the sample means of intensities of individuals and n heter is the number of individual heterozygous for the SNP. For each individual, the frequency of allele A is estimated by calculating the ratio of adjusted intensities pertaining to allele A relative to the total intensity of the two alleles. The estimator and its expectation and variance can be written as follows: where a first-order Taylor expansion is used to calculate of the approximate expectation and variance,   ∧ U 2 is the coefficient of variation of CPA, and s 2 is the experimental variation.
Second, we introduce procedures to estimate a population-level allele frequency, which is a proportion of a specific allele in a study population. The allele frequencies can be estimated based on data from an IGE or a PAE. In an IGE, a population-level allele frequency can be estimated by taking an average over individual-level allele frequencies from genotype data (Equation (A1)) or intensity data (Equation (A2)). The estimators and their expectations and variances can be written as follows: In a PAE, relative intensities of allele A in a DNA pool, which is constructed by mixing genomic DNA from n samples, are calculated to estimate a populationlevel allele frequency. The estimator and its expectation and variance can be written as follows: where G m is AA, Aa or aa. Then three genotype-specific reference confidence intervals can be written as follows:  , are calculated using a spline smoothing or local fit technique, such as a penalized regression [37]. We identify unusual genomic regions where the indices of a patient are higher than the 95%-quantile indices of normal controls in the human genome. The methods are useful for quickly scanning the human genome and accurately determining the starting and ending positions of chromosomal aberration regions for every patient.
A chromosomal aberration plot consisting of three components is used to detect special genomic patterns such as aneuploidy, LOH, and AI (Figures 2 and 3

Appendix C. Allele frequency biplot
The biplot invented by Gabriel [38] was applied to explore microarray gene expression data [39,40] and LOH [41]. This paper first applies a biplot to visualize individual-level allele frequency data for hundreds of thousands of SNPs and large samples. Let A denote the individual-level allele frequency matrix with R rows (SNPs) and C columns (samples). Let F = A -(1/C)·A·1, where 1 is a C × 1 column vector with cell values of 1. The matrix F is decomposed into a multiplication of an R × K SNP-effect matrix (row-effect matrix) R and a C × K sample-effect matrix (column-effect matrix) C using a singular value decomposition as outlined below. First, allele frequency matrix F is decomposed by a singular value decomposition as follows: P is an R × R orthogonal matrix (i.e., PP' = I R ), and the columns of P are eigenvectors of FF'; Q is an R C orthogonal matrix (i.e., QQ' = I R and QQ' = I C ) and the columns of Q are eigenvectors of F'F; D is an R × R diagonal matrix having w nonnegative elements as the square root of nonnegative eigenvalues of FF' and F'F, where w R C = . Second, allele frequency matrix F is further partitioned into row-effect and column-effect matrices as follows: where R = P and C = DQ for a SNP-effect biplot, and R = PD and C = Q for a sample-effect biplot. Note that FF' = RR' for a SNP-effect biplot and FF' = CC' for a sample-effect biplot. For the convenience of visualization, a rank-2 approximate biplot is considered. Namely, the first two columns of R and C with the largest singular values are selected to approximate the allele frequency matrix F. A higher-rank approximate biplot is also plausible by visualizing more than two components pairwise.