Establishing the baseline level of repetitive element expression in the human cortex
© Tyekucheva et al; licensee BioMed Central Ltd. 2011
Received: 16 December 2010
Accepted: 10 October 2011
Published: 10 October 2011
Skip to main content
© Tyekucheva et al; licensee BioMed Central Ltd. 2011
Received: 16 December 2010
Accepted: 10 October 2011
Published: 10 October 2011
Although nearly half of the human genome is comprised of repetitive sequences, the expression profile of these elements remains largely uncharacterized. Recently developed high throughput sequencing technologies provide us with a powerful new set of tools to study repeat elements. Hence, we performed whole transcriptome sequencing to investigate the expression of repetitive elements in human frontal cortex using postmortem tissue obtained from the Stanley Medical Research Institute.
We found a significant amount of reads from the human frontal cortex originate from repeat elements. We also noticed that Alu elements were expressed at levels higher than expected by random or background transcription. In contrast, L1 elements were expressed at lower than expected amounts.
Repetitive elements are expressed abundantly in the human brain. This expression pattern appears to be element specific and can not be explained by random or background transcription. These results demonstrate that our knowledge about repetitive elements is far from complete. Further characterization is required to determine the mechanism, the control, and the effects of repeat element expression.
Roughly half of the human genome is comprised of repetitive elements; these elements range from the 6Kb LINE1 to micro and minisatellites [1–4]. The biological role of repetitive elements is not known and in general they are believed to be nonfunctional sequences. Some are remnants of ancient germline infections and transposition events [3, 5]. These include the human endogenous retroviruses which at one time presumably existed as infectious exogenous agents . Long and Short Interspersed Nuclear Elements (LINE/SINE) are active retrotransposons which continually insert themselves into the human genome, though the vast majority of the elements have mutated beyond function [3, 7, 8].
Endogenous retroviruses and transposons are largely inactive either through DNA methylation or histone modifications [9, 10], or because they have accumulated mutations over time. However, active transposition by LINEs does occur in humans and occasionally causes disease when the insertion occurs in an exon , splice site, or regulatory region . Insertion of L1 elements in the antisense orientation can also truncate transcripts by creating new polyadenylation sites (a phenomenon known as gene breaking). Repetitive elements alter gene expression by introducing promoter regions near genes, or through many other documented mechanisms [13–15]. It has been suggested that repetitive elements contribute to disease pathology by acting as superantigens or by causing an auto-immune response through molecular mimicry, but these hypotheses are largely speculative [16–18]. Surprisingly, certain endogenous retrovirus proteins have acquired beneficial functions in their hosts. For example, the HERV-W env gene, also known as human syncytin-1, contributes to trophoblast cell fusion during placental development . Additionally, the expression of endogenous retroviral and retrotranposon sequences is hypothesized to play a role in preventing infections by exogenous retroviral agents . However, evidence for this theory is tenuous and the current dogma maintains that repeat elements need to be silenced and selected against because they have adverse effects.
Even if the expression of repeat elements is not harmful, repetitive elements have no known function. Because of this, microarray designs generally exclude repeats, both to conserve probes for known functional elements and to avoid having probes whose genomic position is necessarily unknown. Hence, the extent of repeat element expression remains largely unknown. However, there is evidence that certain repetitive elements are expressed at the RNA level in adult somatic cells and some are even translated into proteins [20, 21].
High throughput sequencing (HTS) allows the entire transcriptome of a sample to be sequenced in an unbiased manner and is a good method for analyzing genome wide repetitive element expression. In a HTS experiment, gene expression is determined by counting the number of reads which are present from a particular transcript and normalizing over the total number of reads observed. Two recent papers used this approach to assess the level of repeat elements expression and found repetitive elements to be more widely expressed than previously thought [22, 23]. One of the papers examined repeat expression in the cerebellum, an area mainly involved in motor movements, and reported higher than expected levels of Alu element expression . However, the authors concluded that expression of other repeat elements was due to "transcriptional background." Indeed, earlier studies suggest that a huge proportion of the human genome is transcribed  and transcripts of repetitive elements could be derived from such "promiscuous transcription." We performed deep transcriptome sequencing on the frontal cortex of ten different human postmortem brains and aligned the reads from these samples to all repetitive regions in the human genome. The frontal cortex is involved in complex cognitive functions and abnormalities in this cortical area are associated with various psychiatric illnesses. Consistent with earlier studies, we found that a large proportion of all aligned reads mapped to repeat elements. Unlike previous studies which sequenced pooled samples, we sequenced our samples independently and found that repeat expression patterns are stable between individuals. We also found that Alu element transcripts are expressed at disproportionately high levels whereas truncated LINE elements are expressed at levels much lower than expected. Hence we conclude that the expression of repeat elements is not a side effect of promiscuous, random transcription.
In order to determine the expression level of repetitive elements in cortical tissue, mRNA was extracted from 10 post mortem frontal cortex samples, converted to cDNA and paired end HTS libraries were constructed. Initially, four lanes of paired end sequencing was performed on three samples (RUN1 and RUN2 Additional File 1: Table S1) on an Illumina Genome Analyzer. In order to increase our sample size, we sequenced an additional seven samples (RUN3 Additional File 1: Table S1) on an Illumina HiSeq 2000. We aligned the sequencing data from these runs to a human repeats database that we constructed. The database contains the reference genome (NCBI36/hg18 build) sequence of every annotated repeat region in repBase with a 50 bp flank on each side of the repeat. In this way, we could be reasonably certain that a short read that did not align to our database was not derived from a repetitive sequence. The flanking sequences were included so that a short read that spanned the junction of a repeat element with unique genomic DNA would not be missed. We avoided simply aligning the sequencing reads to the consensus sequences for the repeat classes since transcripts originating from older or slightly divergent members of a class of repetitive elements may be missed using this approach, and many of the consensus sequences contain ambiguity characters. Using this custom database has several advantages over simply aligning the reads to the genome assembly, and then selecting repeats that intersect with the repeats annotations. First, our method allows us to include repeats annotated on the "random" chromosomes. Second, we are able to capture reads that otherwise might align to the part of the assembly that is repeat-like, but was not annotated as a repeat in RepBase. When we map directly onto our repeats database, such reads will align to the closest member of the appropriate repeat family and won't be excluded from the analysis.
Significance Tests for Observed Repeat Element Transcription
Adjusted for multiple comparisons p-value
LINE-1 elements are unique in that when they transpose, the new copy often ends up severely 5' truncated. Only full-length L1 elements are active, so the vast majority of the copies of L1 in the genome are inactive. Upon a careful examination of the 5' end of L1, we observed a clear increase in the number of reads observed from the first few bases of the element, over the simulated data (Additional File 2: figure S1). This indicates that these full-length elements (the only ones capable of retrotransposition) may be expressed at reasonably high levels, though the bulk of the L1 sequences are underexpressed and the 5' end of L1 is the least abundant segment of this element in the genome. Because full length L1 elements carry their own promoter, it is possible that transcripts from these elements are initiated independently from within the L1, not from neighboring genes.
Our results indicate that repetitive elements in the genome are abundantly transcribed in human cortical tissue. A large number of the reads from the sequencing runs aligned to the repeats database we constructed. These results are consistent with the finding of Faulkner et al.  but they are much lower than the estimates of Xu et al  who report 40% of all transcripts are derived from repetitive sequences. We can not compare our results with Xu et al since they used total RNA as their starting material whereas we used poly-A purified RNA to construct our HTS-libraries. In addition, our study examined cortical tissue which has significantly different gene expression and DNA methylation profiles than the cerebellum . Also, the Northern blot analysis we performed independently validates our findings and assures us that our observations are not due to technical issues related to DNA contamination or peculiarities in the alignment software. Finally, the sequencing data in this paper was generated on two different sequencing platforms, using two separate library preparation techniques by two independent laboratories. The analysis of our sequencing runs along with the MAQC data set yield similar results and demonstrate the robustness of our findings.
The Northern blot analysis we performed suggests that HERV-W gag expression is regulated in a tissue specific manner and these results are consistent with the findings of Nellaker et al . Since the different tissues from the Northern blots did not originate from the same sample, we may be detecting inter individual variation for HERV-W gag expression, which in itself would suggest that this expression is regulated. It is possible, though, that tissue-specific epigenetic differences can drive or repress repeat expression, and this phenomenon can be explored in future studies.
The biological process behind expressing repeat elements at such high levels remains unclear, though random transcription does not explain the expression patterns seen and another mechanism must be involved. The majority of these elements do not have open reading frames (due to mutations) and most likely do not encode proteins, though the full length L1 elements are possibly functional. RNA editing of Alu sequences has been described ; the presence of this mechanism suggests that Alu elements are expressed and then suppressed. Other elements have less clear roles; in fact, LINE-1 retrotransposition activity has not been correlated with its expression. Suggesting a possible function for the high repeat expression levels at this point would be purely speculative; here we attempt only to establish a baseline for further analyses.
Despite making up nearly half the human genome, very little is known about the expression patterns of repetitive elements. In this article we report that repetitive elements make up a significant proportion of the cortical transcriptome. In addition to characterizing the tissue-specific expression profile, the inter-individual transcription variation needs to be determined for these elements. Our data indicate inter-individual variation in the expression of repeat element families whereas the total number of repeat elements in the transcriptome of each individual appears to be fairly consistent. Establishing the normal expression state of repeat elements would be particularly useful for determining if alterations to this expression pattern can cause (or indicate) disease. The presence of the RNA editing machinery which appears to target Alu sequences, along with repetitive element expression that is not consistent with a simple "background expression" model suggests that our knowledge of these not-so-silent elements is still incomplete.
In this article, we report the abundant expression of repeat elements in the human cortex. Remarkably, random background expression can not account for the observed lower than expected LINE-1 and higher than expected Alu element transcription. In addition, our data suggests that full length LINE-1 elements may be expressed at higher levels than truncated copies. Further characterization is required to determine the basis and the consequences of repeat element expression.
All postmortem brain tissue samples were kindly donated by the Stanley Medical Research Institute. Frontal cortex tissue was dissected from the medial frontal gyrus and total RNA was purified using the RNeasy kit (QIAGEN, CA). The quality of each sample was verified by running an aliquot on a Bioanalyzer (Agilent Technologies, CA).
Three sequencing runs were performed for this paper. RUN1 and RUN2 were performed at Cold Spring Harbor Laboratory and RUN3 was performed at Johns Hopkins University. For the 3 samples in RUN1 and RUN2 10 μg of TRIzol-extracted total RNA was cleaned up using the RNeasy Mini Kit (QIAGEN, CA) prior to cDNA cloning. To isolate the mRNA fraction from each sample, the total RNA was treated to two consecutive rounds of denaturation at 65°C and binding to Dynabeads Oligo(dT)25 (Invitrogen, CA). The purified mRNA samples were then fragmented using Fragmentation Buffer (Ambion, CA) and recovered by ethanol precipitation with glycogen as a carrier. For first strand cDNA synthesis, the mRNA samples were incubated with Random Hexamer Primers (Invitrogen, CA), First Strand Buffer (Invitrogen, CA), RNaseOUT (Invitrogen, CA), DTT, dNTP mix, and SuperScript II (Invitrogen, CA) in a thermal cycler using the following program: 25°C for 10 min, 42°C for 50 min, 70°C for 15 min, and 4°C hold. For second strand cDNA synthesis, the first strand samples were mixed with second strand buffer (50 mM Tris-HCl, pH 7.8; 5 mM MgCl2; and 1 mM DTT), dNTP mix, RNase H (Invitrogen, CA), and DNA Polymerase I (Invitrogen, CA), followed with incubation at 16°C for 2.5 h. The cDNA samples were purified using the QIAquick PCR Purification Kit (QIAGEN, CA). Following purification, the samples were converted into cDNA libraries using a series of steps derived from the Paired-End Sample Prep Kit (Illumina, CA). The modifications to the standard library preparation procedure are as follows: (1) after the addition of 3'-A overhangs, the samples were purified using the MinElute PCR Purification Kit (QIAGEN) and the entire eluates were used for the subsequent Illumina adapter ligation step, (2) the selected library fragment size was 200 bp, (3) the gel extraction step was carried out using the MinElute Gel Extraction Kit (QIAGEN, CA) and 20 μL of Buffer EB for elution, and (4) the entire gel extraction eluates were used for PCR enrichment of the libraries using a sixteen-cycle program. The completed cDNA libraries were individually sequenced on an Illumina Genome Analyzer producing 100 base reads from each end. For the 7 samples in RUN3, 5 μg of RNA was treated with Turbo DNase (Ambion, CA) and paired end sequencing libraries were generated using the TruSeq RNA Sample prep kit following the manufacturer's recommended protocol. These libraries were sequenced on a HiSeq 2000 producing 100 base reads from each end.
Premade blots containing 1 μg of mRNA derived from adult human tissues were purchased from Biochain (Hayward, CA). RNA probes were generated using the Riboprobe kit (Promega, WI) and P-32 labeled cytosines. The hybridization and washes on the blot were carried out using the Northern Max kit (Ambion, CA).
We treated beginnings and ends of the paired-end reads as technical replicates, and aligned them separately. As we accept any alignment of a read to our database of repeats, splitting the paired ends does not compromise our ability to correctly assign reads to repeat families. All reads for the frontal cortex samples were trimmed to 76 base length. We aligned the reads to reference subgenomes (RepBase, intronic, and intergenic regions) using Bowtie software http://bowtie-bio.sourceforge.net/index.shtml using parameters that allow zero mismatches in the first 20 bases of the read, and retain the best alignment for the read. The same parameters were used to align both actual and simulated reads. For alignments to introns and intergenic regions we only accepted reads that aligned exactly once in the entire genome. Genomic coordinates used to build reference subgenomes were determined using UCSC Genome Browser  and sequences from hg18 build were fetched using Galaxy (http://g2.bx.psu.edu/).
In simulations we aimed to reproduce the scenario of random (uniform) transcription; while our simulated distributions closely mirror the genomic profiles of the sequences being evaluated, the simulations are critical to establishing statistical significance, as they control for any biases introduced by the alignment process. We extracted genomic sequences for RepBase annotated repeats, generated random start sites for the reads and extracted 76 base pairs for our simulated reads. Sampling from CCDS was done similarly, using consensus coding sequences from NCBI. We generated 100 datasets, each comprising the appropriate number of reads (see Results) and read length.
For statistical testing we pooled the data from the technical replicates of the same biological sample (we considered beginnings and ends of the paired ends to be technical replicates as well). We then compared read counts from these 10 samples mapping to various repeat families with read counts from our simulated data. We employed a statistical test proposed by Anders and Huber . This test for differential expression in RNA-Seq data models the read counts mapping to a given repeat family using a negative binomial distribution with the parameters estimated from the observed data. We used R/Bioconductor package "DESeq" implementation of the test. This R/Bioconductor package performs multiple testing correction using Benjamini-Hochberg method .
The authors would like to thank the Stanley Medical Research Institute and Dr. Fuller Torrey and Maree Webster for kindly providing the postmortem cortical tissue used in this study. The authors are grateful to The Stanley Medical Research Institute for funding this research at Johns Hopkins Stanley Division of Developmental Neurovirology and through the Stanley Institute for Cognitive Genomics at CSHL.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.