AluScan: a method for genome-wide scanning of sequence and structure variations in the human genome
- Lingling Mei†1,
- Xiaofan Ding†1,
- Shui-Ying Tsang1,
- Frank W Pun1,
- Siu-Kin Ng1,
- Jianfeng Yang1,
- Cunyou Zhao1,
- Dezhi Li2,
- Weiqing Wan2, 3,
- Chi-Hung Yu4, 5,
- Tze-Ching Tan4, 5,
- Wai-Sang Poon5, 6,
- Gilberto Ka-Kit Leung5, 7,
- Ho-Keung Ng5, 8,
- Liwei Zhang2, 3Email author and
- Hong Xue1, 3, 5Email author
© Mei et al; licensee BioMed Central Ltd. 2011
Received: 22 August 2011
Accepted: 17 November 2011
Published: 17 November 2011
To complement next-generation sequencing technologies, there is a pressing need for efficient pre-sequencing capture methods with reduced costs and DNA requirement. The Alu family of short interspersed nucleotide elements is the most abundant type of transposable elements in the human genome and a recognized source of genome instability. With over one million Alu elements distributed throughout the genome, they are well positioned to facilitate genome-wide sequence amplification and capture of regions likely to harbor genetic variation hotspots of biological relevance.
Here we report on the use of inter-Alu PCR with an enhanced range of amplicons in conjunction with next-generation sequencing to generate an Alu-anchored scan, or 'AluScan', of DNA sequences between Alu transposons, where Alu consensus sequence-based 'H-type' PCR primers that elongate outward from the head of an Alu element are combined with 'T-type' primers elongating from the poly-A containing tail to achieve huge amplicon range. To illustrate the method, glioma DNA was compared with white blood cell control DNA of the same patient by means of AluScan. The over 10 Mb sequences obtained, derived from more than 8,000 genes spread over all the chromosomes, revealed a highly reproducible capture of genomic sequences enriched in genic sequences and cancer candidate gene regions. Requiring only sub-micrograms of sample DNA, the power of AluScan as a discovery tool for genetic variations was demonstrated by the identification of 357 instances of loss of heterozygosity, 341 somatic indels, 274 somatic SNVs, and seven potential somatic SNV hotspots between control and glioma DNA.
AluScan, implemented with just a small number of H-type and T-type inter-Alu PCR primers, provides an effective capture of a diversity of genome-wide sequences for analysis. The method, by enabling an examination of gene-enriched regions containing exons, introns, and intergenic sequences with modest capture and sequencing costs, computation workload and DNA sample requirement is particularly well suited for accelerating the discovery of somatic mutations, as well as analysis of disease-predisposing germline polymorphisms, by making possible the comparative genome-wide scanning of DNA sequences from large human cohorts.
Next-generation, massively-parallel sequencing technologies have transformed the landscape of genetics through their ability to produce giga-bases of sequence information in a single run. However, the sequencing cost, computation workload and amount of sample DNA required are still too high for large scale population analysis by means of whole-genome sequencing. There is clearly a need for pre-sequencing capture of subsets of the genome in order to reduce these requirements. Although the whole exome represents a valuable subset, its exclusion of introns, and the high cost and high DNA requirement for its analysis, remain major limitations. Other sequence subsets therefore clearly need to be explored.
Alu-transposons are a family of primate-specific short interspersed nucleotide elements (SINE) of ~ 300 bp derived from 7SL RNA . Although Alu elements were once considered as 'junk DNA', their biological importance, in particular their influence on genome instability is being increasingly recognized [2, 3]. They are abundant in gene-rich regions [4, 5], exert a major impact on genomic architecture , and increase local recombination rates . Previously we have found enhanced SNP frequencies in the vicinity of Alu-elements , more so among the youngest AluY elements than the intermediate-age AluS and the oldest AluJ. AluYs display also a higher rate of methylation, consistent with a stronger silencing pressure on these elements . Genotypic variations surrounding a human lineage-specific AluY insertion in the GABRB2 gene encoding GABAA receptor β2 subunit have been found by us to constitute a joint focal point for positive evolutionary selection , hotspot recombinations  as well as association with schizophrenia and bipolar disorder [12, 13]. Neighborhoods of Alu-transposons are therefore a highly significant sequence subset of the human genome in terms of evolutionary development and pathogenesis.
Control and glioma sequence outputs from AluScan
Total initial reads (bp)
Initial reads (bp) mapped to GRCh37.p2
Genomic regions mapped (bp) with coverage ≥ 1
Reads mapped (bp) with coverage ≥ 10
Genomic regions mapped with coverage ≥ 10:
Total regions (bp)
Average depth (x)
Inter-Alu sequences mapped (bp)
Genic sequences mapped (bp)
Genic sequences mapped as % of total mapped regions
Excess in genic sequencesa
Number of genes with mapped region
Number of cancer candidate genesb with mapped region
Cancer candidate genesb with mapped region as
% of genes with mapped region
Excess in cancer candidate genesc
Seven 5-Mb intervals in the glioma sequences displayed enhanced numbers of somatic SNVs, where the number of somatic SNVs > 4, indicating the potential presence of somatic SNV hotspots (Figure 6). Of these seven potential SNV hotspots, those in chromosomal regions 12q13, 17q21, 18p11, 19p13 and 19q13 harboured altogether 16 SNV-containing genes including RAB5C of the RAS oncogene family in 17q21 (Additional File 4). None of these 16 genes were included in OMIM as a known glioma-associated gene. These findings illustrated the usefulness of AluScan as a discovery tool.
Using only 90 ng sample DNA in each instance, the AluScans performed in the present study with one H-type and two T-type primers generated reads that covered a total of ~58-64 Mb, or ~1.9-2.1% of genomic sequences. This total was comparable in order of magnitude to the genomic sequences in principle capturable by the set of three H and T-type consensual Alu-based primers employed, which were estimated to be ~14 Mb for exact primer-template matches, or ~106 Mb allowing for one mismatched base-pair per primer (Figure 1B), but still far below the total of 1.10 Gb inter-Alu regions of ≤ 6 kb in length in the human genome (Figure 1A). Thus there could be ample room for widening the scope of AluScan-capturable sequences through the use of diverse combinations of H- and T-type primers. Primers specific for other transposable elements such as LINEs, LTRs, as well as other types of more specialized primers could also be utilized to tailor the AluScan capture to a given investigational goal. Moreover, by treating target DNA with bisulfite to modify unmethylated C-residues prior to AluScan, epigenomic changes in normal and diseased cells may also be monitored.
By combining the twin advantages of multitudinous amplification of inter-Alu sequences through the joint usage of H-type and T-type primers, and massively parallel next-generation sequencing, AluScan thus provides a new method for genome-wide investigation in addition to whole genome sequencing (WGS) and whole exome sequencing (WES). WGS is the standard in comprehensiveness, but incurs high operation cost, large computation workload and multi-microgram DNA requirement. WES provides integral insight into the entire exome, but leaves the intronic regions uncharacterized, besides incurring high capture cost and multi-microgram DNA requirement. AluScan permits an examination of gene-enriched segments of exons, introns and intergenic sequences requiring comparatively modest capture and sequencing costs, lighter computation workload and only sub-microgram DNA samples. These three methods complement one another, together making possible a comprehensive analysis of sequence and structure variations of the human genome.
AluScan implemented with just a small number of PCR primers based on consensus Alu sequences provides a multiplex method for genome-wide sequence analysis. Through the inclusion of H and T type primers, the approach employs the abundance and wide distribution of Alu elements in the human genome as the basis for the effective capture of a huge number of DNA sequences in the vicinity of Alu elements. As demonstrated by the strong correlation between the captured white blood cell and glioma sequences, the same set of H and T-type primers has led to an extensively reproducible subset of genomic sequences in the two separate AluScans. As well, at least for this set of H and T-type primers, the captured sequences were enriched in genic and cancer-related DNA sequences.
The results in Figure 6 illustrate the utility of AluScan as a discovery tool. Comparison of the paired while blood cell-glioma DNAs of a single patient has led to the uncovering of 357 LOHs and 274 somatic SNVs, a majority of which likely arising in the glioma, and seven potential SNV hotspots located on six different chromosomes. Importantly, the modest technical cost and DNA sample size required for AluScan will render practicable a follow up with similarly paired AluScans for tens to hundreds of glioma patients in order to distinguish the somatic and germline driver mutations fundamental to the development of the disease from passenger mutations. A major application of AluScan will thus reside in its facilitation of large cohort studies for clinical and biological investigations of the human genome.
Paired blood and cancer samples were obtained with consent and institutional ethics approval from a male Chinese Han patient with anaplastic oligodendroglioma at Beijing Tiantan Hospital for the preparation of control DNA by phenol-chloroform extraction and cancer genomic DNA using the AllPrep kit (Qiagen).
Inter-Alu PCR and next-generation sequencing
Fifteen parallel 25-μl PCR reaction mixtures each containing 2 μl Bioline 10× NH4 buffer (160 mM ammonium sulfate, 670 mM Tris-HCl, pH 8.8, 0.1% stabilizer), 3 mM MgCl2, 0.15 mM dNTP mix, 0.3 μM AluY278T18 primer, 0.18 μM AluY66H21 primer, 0.06 μM R12A/267 primer, 1 unit Bioline Taq polymerase, and 6 ng control or glioma DNA. PCR amplification for AluScan included DNA denaturation at 95°C for 5 min, followed by 35 cycles each of 30 s at 95°C, 30 s at 54°C, and 5 min at 71°C, and finally another 5 min at 71°C. Amplicons were purified with ethanol precipitation, and ≥ 3 μg purified products per sample were employed for Illumina GAII library construction and sequencing at Beijing Genomics Institute (Shenzhen, China). AluY278T18 (5'-GAGCGAGACTCCGTCTCA-3'), where 'AluY' represents the subfamily, '278' the first position on the AluY consensus sequence paired with the primer, 'T' a 'Tail-type' primer (vs. 'H' for 'H-type'), and '18' the length of the primer, and AluY66H21 (5'-TGGTCTCGATCTCCTGACCTC-3') were AluY consensus primers [22, 23]. R12A/267 (T-type) was an Alu consensus primer employed earlier for inter-Alu PCR at an annealing temperature of 56°C .
Agarose gel electrophoresis
PCR was performed basically as described in the preceding section, except that one PCR tube of 20 μl containing 100 ng control DNA was employed. The annealing temperatures were chosen to maximize in each instance the yield of amplicons: 60°C for lane A in Figure 2, 58°C for B-D, 56°C for H and L, 64°C for N, and 54°C for the other lanes. Primer concentration was 0.30 μM for the single-primer lanes A-D; 0.15 μM per primer for the two-primer lanes E-J; 0.10 μM per primer for the three-primer lanes K, L, R, T-V; 0.30 μM per primer for the triple-dosed lane S. The concentrations of primers AluY278T18, AluY66H21 and R12A/267 in lane Q were 0.375 μM, 0.225 μM, 0.075 μM respectively; lane P was same as lane Q with omission of R12A/267; and lane N was same as lane P with further omission of AluY66H21.
Read mapping and variant analysis
Sequence reads were mapped to the GRCh37.p2 reference human genome using BWA (bwa-short algorithm version 0.5.9rc1) with default settings . Initial mapping results were transferred into indexed and sorted BAM format using SAMtools version 0.1.12a , and further recalibrated and locally realigned using the Genome Analysis Toolkit (GATK version 1.0.4905) software . Regions with read depths of < 10× were not analyzed further.
The UnifiedGenotyper module in GATK was used to produce the primary SNV calls, which were filtered using the parameter '-stand_call_conf 50.0' and the Variant Filtration module, ensuring a coverage depth > 10 x, mapping quality > 25.0 and strand bias < 0. SNVs in the vicinity of indels were removed by means of the IndelGenotyperV2 module. Further filtration was achieved using the criterion that homozygous reference loci have a non-reference read frequency of < 10%, heterozygous SNVs have a non-reference read frequency of ≥10% and <85%, and homozygous non-reference SNVs have a non-reference read frequency of ≥ 85%. Small indels were called using mpileup with '-ugf' and bcftools with '-bvcg' in SAMtools; and the calls were filtered using the script vcfutils.pl in SAMtools with default settings. Structural variants were identified initially using BreakDancer version 1.1  and refined using Pindel version 0.20 . Somatic SNVs were defined as heterozygous loci present in the tumor genome that corresponded to homozygous loci in the control genome, and LOH SNVs were defined as heterozygous loci present in the control genome that corresponded to homozygous loci in the tumor genome. Novel somatic SNVs were obtained by removing all LOHs and those SNVs already reported in dbSNP132. LOHs were identified by comparison between control and glioma reads using ExomeCNV version 1.23.0 .
We are grateful to the Innovation and Technology Fund of Hong Kong SAR (Grant ITS/085/10) and Hong Kong University of Science and Technology (Grant VPRDO09/10.SC08 and Special Research Fund Initiative SRFI11SC06) for financial support.
- Ullu E, Tschudi C: Alu sequences are processed 7SL RNA genes. Nature. 1984, 312: 171-172. 10.1038/312171a0.PubMedView ArticleGoogle Scholar
- Konkel MK, Batzer MA: A mobile threat to genome stability: The impact of non-LTR retrotransposons upon the human genome. Semin Cancer Biol. 2010, 20: 211-221. 10.1016/j.semcancer.2010.03.001.PubMed CentralPubMedView ArticleGoogle Scholar
- Zhang Y, Romanish MT, Mager DL: Distributions of transposable elements reveal hazardous zones in mammalian introns. PLoS Comput Biol. 2011, 7: e1002046-10.1371/journal.pcbi.1002046.PubMed CentralPubMedView ArticleGoogle Scholar
- Batzer MA, Deininger PL: Alu repeats and human genomic diversity. Nat Rev Genet. 2002, 3: 370-379. 10.1038/nrg798.PubMedView ArticleGoogle Scholar
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, et al: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.PubMedView ArticleGoogle Scholar
- Deininger PL, Moran JV, Batzer MA, Kazazian HH: Mobile elements and mammalian genome evolution. Curr Opin Genet Dev. 2003, 13: 651-658. 10.1016/j.gde.2003.10.013.PubMedView ArticleGoogle Scholar
- Witherspoon DJ, Watkins WS, Zhang Y, Xing J, Tolpinrud WL, Hedges DJ, Batzer MA, Jorde LB: Alu repeats increase local recombination rates. BMC Genomics. 2009, 10: 530-10.1186/1471-2164-10-530.PubMed CentralPubMedView ArticleGoogle Scholar
- Ng SK, Xue H: Alu-associated enhancement of single nucleotide polymorphisms in the human genome. Gene. 2006, 368: 110-116.PubMedView ArticleGoogle Scholar
- Rodriguez J, Vives L, Jorda M, Morales C, Munoz M, Vendrell E, Peinado MA: Genome-wide tracking of unmethylated DNA Alu repeats in normal and cancer cells. Nucleic Acids Res. 2008, 36: 770-784.PubMed CentralPubMedView ArticleGoogle Scholar
- Lo WS, Xu Z, Yu Z, Pun FW, Ng SK, Chen J, Tong KL, Zhao C, Xu X, Tsang SY, Harano M, Stober G, Nimgaonkar VL, Xue H: Positive selection within the Schizophrenia-associated GABAA receptor β2 gene. PLoS One. 2007, 2: e462-10.1371/journal.pone.0000462.PubMed CentralPubMedView ArticleGoogle Scholar
- Ng SK, Lo WS, Pun FW, Zhao C, Yu Z, Chen J, Tong KL, Xu Z, Tsang SY, Yang Q, Yu W, Nimgaonkar V, Stober G, Harano M, Xue H: A recombination hotspot in a schizophrenia-associated region of GABRB2. PLoS One. 2010, 5: e9547-10.1371/journal.pone.0009547.PubMed CentralPubMedView ArticleGoogle Scholar
- Lo WS, Lau CF, Xuan Z, Chan CF, Feng GY, He L, Cao ZC, Liu H, Luan QM, Xue H: Association of SNPs and haplotypes in GABAA receptor β2 gene with schizophrenia. Mol Psychiatry. 2004, 9: 603-608. 10.1038/sj.mp.4001461.PubMedView ArticleGoogle Scholar
- Zhao C, Xu Z, Wang F, Chen J, Ng SK, Wong PW, Yu Z, Pun FW, Ren L, Lo WS, Tsang SY, Xue H: Alternative-splicing in the exon-10 region of GABAA receptor β2 subunit gene: relationships between novel isoforms and psychotic disorders. PLoS One. 2009, 4: e6977-10.1371/journal.pone.0006977.PubMed CentralPubMedView ArticleGoogle Scholar
- Nelson DL, Ledbetter SA, Corbo L, Victoria MF, Ramirez-Solis R, Webster TD, Ledbetter DH, Caskey CT: Alu polymerase chain reaction: a method for rapid isolation of human-specific sequences from complex DNA sources. Proc Natl Acad Sci USA. 1989, 86: 6686-6690. 10.1073/pnas.86.17.6686.PubMed CentralPubMedView ArticleGoogle Scholar
- Zietkiewicz E, Labuda M, Sinnett D, Glorieux FH, Labuda D: Linkage mapping by simultaneous screening of multiple polymorphic loci using Alu oligonucleotide-directed PCR. Proc Natl Acad Sci USA. 1992, 89: 8448-8451. 10.1073/pnas.89.18.8448.PubMed CentralPubMedView ArticleGoogle Scholar
- Kass DH, Batzer MA: Inter-Alu polymerase chain reaction: advancements and applications. Anal Biochem. 1995, 228: 185-193. 10.1006/abio.1995.1338.PubMedView ArticleGoogle Scholar
- Krajinovic M, Richer C, Labuda D, Sinnett D: Detection of a mutator phenotype in cancer cells by inter-Alu polymerase chain reaction. Cancer Res. 1996, 56: 2733-2737.PubMedGoogle Scholar
- Srivastava T, Seth A, Datta K, Chosdol K, Chattopadhyay P, Sinha S: Inter-alu PCR detects high frequency of genetic alterations in glioma cells exposed to sub-lethal cisplatin. Int J Cancer. 2005, 117: 683-689. 10.1002/ijc.21057.PubMedView ArticleGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25: 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralPubMedView ArticleGoogle Scholar
- Kim DW, Nam SH, Kim RN, Choi SH, Park HS: Whole human exome capture for high-throughput sequencing. Genome. 2010, 53: 568-574. 10.1139/G10-025.PubMedView ArticleGoogle Scholar
- Durand KS, Guillaudeau A, Weinbreck N, DeArmas R, Robert S, Chaunavel A, Pommepuy I, Bourthoumieu S, Caire F, Sturtzand FG, Labrousse FJ: 1p19q LOH patterns and expression of p53 and Olig2 in gliomas: relation with histological types and prognosis. Mol Pathol. 2010, 23: 619-628. 10.1038/modpathol.2009.185.View ArticleGoogle Scholar
- Park ES, Huh JW, Kim TH, Kwak KD, Kim W, Kim HS: Analysis of newly identified low copy AluYj subfamily. Genes Genet Syst. 2005, 80: 415-422. 10.1266/ggs.80.415.PubMedView ArticleGoogle Scholar
- Price AL, Eskin E, Pevzner PA: Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. Genome Res. 2004, 14: 2245-2252. 10.1101/gr.2693004.PubMed CentralPubMedView ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralPubMedView ArticleGoogle Scholar
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010, 20: 1297-1303. 10.1101/gr.107524.110.PubMed CentralPubMedView ArticleGoogle Scholar
- Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, Zhang Q, Locke DP, Shi X, Fulton RS, Ley TJ, Wilson RK, Ding L, Mardis ER: BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods. 2009, 6: 677-681. 10.1038/nmeth.1363.PubMed CentralPubMedView ArticleGoogle Scholar
- Ye K, Schulz MH, Long Q, Apweiler R, Ning Z: Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics. 2009, 25: 2865-2871. 10.1093/bioinformatics/btp394.PubMed CentralPubMedView ArticleGoogle Scholar
- Sathirapongsasuti JF, Lee H, Horst BA, Brunner G, Cochran AJ, Binder S, Quackenbush J, Nelson SF: Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV. Bioinformatics. 2011, 27: 2648-2654. 10.1093/bioinformatics/btr462.PubMed CentralPubMedView ArticleGoogle Scholar
- Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, Jones SJ, Marra MA: Circos: an information aesthetic for comparative genomics. Genome Res. 2009, 19: 1639-1645. 10.1101/gr.092759.109.PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.