A transcriptional sketch of a primary human breast cancer by 454 deep sequencing

Background The cancer transcriptome is difficult to explore due to the heterogeneity of quantitative and qualitative changes in gene expression linked to the disease status. An increasing number of "unconventional" transcripts, such as novel isoforms, non-coding RNAs, somatic gene fusions and deletions have been associated with the tumoral state. Massively parallel sequencing techniques provide a framework for exploring the transcriptional complexity inherent to cancer with a limited laboratory and financial effort. We developed a deep sequencing and bioinformatics analysis protocol to investigate the molecular composition of a breast cancer poly(A)+ transcriptome. This method utilizes a cDNA library normalization step to diminish the representation of highly expressed transcripts and biology-oriented bioinformatic analyses to facilitate detection of rare and novel transcripts. Results We analyzed over 132,000 Roche 454 high-confidence deep sequencing reads from a primary human lobular breast cancer tissue specimen, and detected a range of unusual transcriptional events that were subsequently validated by RT-PCR in additional eight primary human breast cancer samples. We identified and validated one deletion, two novel ncRNAs (one intergenic and one intragenic), ten previously unknown or rare transcript isoforms and a novel gene fusion specific to a single primary tissue sample. We also explored the non-protein-coding portion of the breast cancer transcriptome, identifying thousands of novel non-coding transcripts and more than three hundred reads corresponding to the non-coding RNA MALAT1, which is highly expressed in many human carcinomas. Conclusion Our results demonstrate that combining 454 deep sequencing with a normalization step and careful bioinformatic analysis facilitates the discovery and quantification of rare transcripts or ncRNAs, and can be used as a qualitative tool to characterize transcriptome complexity, revealing many hitherto unknown transcripts, splice isoforms, gene fusion events and ncRNAs, even at a relatively low sequence sampling.


Background
LTR Retrotransposons transpose through reverse transcription of an RNA intermediate and are ubiquitous components of all eukaryotic genomes thus far examined [1]. Plant genomes, in particular, have been found to be comprised of a remarkably high number of LTR retrotransposons [2][3][4]. For example, more than half of the maize [5] and over 90% of the wheat [6] genomes are comprised of LTR retrotransposons. There is a significant body of direct and indirect evidence that LTR retrotransposons have contributed to gene and genome evolution in both animals and plants [4,[7][8][9][10][11][12][13].
Rice (Oryza sativa L.) is a staple food for over half of the world's population. Unlike most other cereal grasses, rice has a relatively small genome estimated at 430 Mb [14,15]. Comparative genetic maps within the grass family indicate extensive regions of conserved gene content and orders [16]. The development of many important tools for genetic analysis, including excellent genetic and physical maps [17,18], efficient genetic transformation techniques [19], an ever increasing dataset of expressed sequence tags (ESTs) [20], and a large collection of diverse germplasm [21] has established rice as an ideal model for the study of mechanisms underlying plant evolution. The recent release of draft genome sequences of two rice subspecies, indica [22] and japonica [23], and ongoing efforts to compile a complete japonica rice genome sequence publicly available by International Rice Genome Sequencing Project (IRGSP) [15], promises to greatly facilitate the study of rice genome evolution.
Most TEs were earlier discovered by chance or by limited assays using conserved regions of retrotransposons in rice [24][25][26]. For example, based on the screening of genomic libraries with a conserved retrotransposon probe, Hirochika and his colleagues [24] made an estimate of ~1000 retrotransposons totally in the rice genome, which fall into 32 families. By hybridizing a rice BAC library with a variety of RT probes, however, Wang et al. [27] estimated that the genome contains only about 100 copia-like elements in the entire haploid genome. Computer based sequence similarity searches of the nucleic acid databases are considered to be a more accurate method to identify retrotransposon families and estimate TE content and distribution in rice genome [28][29][30][31]. Using a new data-mining program (LTR retrotransposon structure program), we previously mined the GeneBank rice database (GBRD) as well as a more extensive Monsanto rice data set (MRD) (259 Mb) of LTR retrotransposons [32]. Including numerous LTR retrotransposon families that have not been previously reported, our comprehensive survey indicated that there are at least 59 distinct O. sativa LTR retrotransposon families comprising ≈17% of the rice genome. Although the majority of rice retrotransposon families have been identified and their phylogenetic relationships have been inferred based on the sequence similarity of the RT domains, the evolutionary dynamics, substructure, and relative integration time or age of rice retrotransposon elements remain poorly understood.
Our laboratory has long been interested in the evolution of LTR retrotransposons and their contribution to genome evolution. The current availability of a considerable part of the rice sequences at NCBI database is providing an excellent and timely opportunity to analyze the evolutionary history of rice LTR-retrotransposons. The present study was undertaken (i) to assess retrotransposon diversity and timing of insertion events by identifying all distinguishable rice LTR-retrotransposon sequences in the published sequence database, and(ii) to further infer, based on the establishment of substructure of all characterized families using the sequence divergence among the more rapidly evolving LTRs, possible retrotransposon evolution. In order to gain insight into the chromosomal distribution of LTR-retrotransposons in the rice genome, we also selected chromosome 10 to conduct a detailed examination of the intra-chromosomal distribution of LTR-retrotransposons. Finally, we have made a preliminary attempt to assess the potential contribution of LTR-retrotransposons to the evolution of gene structure and function in rice by identifying elements located in or near putative genes.

Characterization of O. sativa LTR Retrotransposons
LTR retrotransposons displaying ≥ 90% reverse transcriptase (RTs) pairwise identity at the amino acid level were, by earlier convention, assigned to the same family [33]. Using this criterion, McCarthy et al. [32] previously identified and characterized 55 families of LTR retrotransposons in the rice genome. Four additional groups of nonautonomous LTR retrotransposons identified in this previous study did not display a RT sequence homology but were, nevertheless, designated as families based on their distinct structures [32]. Only 38 retrotransposon families for which both LTRs have been identified [32] were included in this study.
In this study, we have extended the earlier survey of fulllength LTR retrotransposons in the rice genome to include all identifiable fragments of LTR retrotransposon sequences. In accord with previously established criteria [34], we classified rice LTR retrotransposon sequences into three major groups: 1) full-length elements: a) autonomous full-length elements contain all of the characteristic features of LTR retrotransposons including putative gag, pol and, in some cases, env genes flanked by LTRs; b) nonautonomous full-length elements are a recently identified [32,39] sub-class of full-length elements that have two LTRs flanking a series of repeating motifs of various lengths. Although numerous, relatively small and, in some cases, recently transposed, these non-autonomous elements encode none of the typical LTR retrotransposon ORFs and must acquire essential reverse transcription and intergration functions in trans; 2) solo LTRs are solitary LTR elements believed to be the products of recombination events between the flanking LTRs of full-length elements [1]; 3) fragmented elements are defined as partially deleted or truncated LTR retrotransposon sequence. This third category is a "catch all" grouping that includes all LTR retrotransposon sequences that are neither full-length elements or solo LTRs. For information purposes, we report in this paper both the total number of fragmented elements, as well as, the number of these that are fragmented solo LTRs.
We identified a total of 1219 LTR retrotransposon sequences in the portion of the rice genome surveyed in this study distributed over 38 families (Tables 1, see Additional file 1; Table 1). Gypsy-like elements are • 4 × more Other distinguishing characteristics of the LTR retrotransposon sequences identified in this study are presented in Table 1 (see Additional file 1), including clone accession numbers, chromosomal location, sequence length, target site repeats (TSRs), LTR pairwise identities and estimated element age. TSRs result from a duplication of the unoccupied insertion site following element insertion [35]. The TSRs of all rice LTR retrotransposon are five bp long except for Osr26.1 which has a TSR of seven bp. While members of O. sativa families share >90% RT sequence identity [32], sequence identity values among the more rapidly evolving LTRs are highly variable within families, ranging from 75.505% to 100%.

Phylogenetic Substructure of O. sativa LTR Retrotransposons
The slowly evolving RT encoding region of LTR retrotransposons is ideal for calculating evolutionary distances among even distantly related families of retroelements [6,36]. However, sequence analyses of the more rapidly evolving LTRs are better suited for the characterization of phylogenetic substructure within families of LTR retrotransposons. We used LTR sequence divergence among elements within each family to identify sub-structure within the O. sativa LTR retrotransposon (Osr) phylogeny.
The Osr26 family, for example, is comprised of 32 elements falling into at least 5 distinct clades (Fig. 1). The Osr27 family, which is composed of 134 elements, displayed complicated substructure, consisting of at least 10 divergent clades with strong bootstrap support (Fig. 2). The remaining 27 families are closely related and displayed no significant intra-family substructure.  The LTRs of one gypsy-like (Osr27.2) and one copia-like (Osr13.8) element displayed atypically high levels of sequence divergence indicating that these elements are exceptionally old or possibly that these elements are, in fact, hybrid elements generated by homologous recombination or some other recombination process. Indeed, such inter-element recombination events have been previously documented in yeast [41,42]. However, in this case, recombination is unlikely since the target site duplications of these (and indeed all full-length elements identified in this study) are identical. Thus, we have no direct evidence that any of the full-length rice LTR retrotransposons analyzed in this study were generated by recombination.

Distribution of Rice LTR-retrotransposons on Chromosome 10
At the time of this study, about 48% rice genomic sequences were available in the public database, which included the almost entirely sequenced chromosome 1 and 10. In an initial effort to gain insight into the intrachromosomal distribution of O. sativa LTR retrotransposons, we selected Chromosome 10 for a more detailed analysis. Tests were conducted to determine whether LTR retrotransposon sequences were randomly distributed on Chromosome 10. We found that the average density of elements on Chromosome 10 is 22.321/Mb (Table 2). There is a nonrandom clustering of both copia-like and Phylogenetic trees of subfamily structure based on LTR nucleotide sequence data Figure 1 Phylogenetic trees of subfamily structure based on LTR nucleotide sequence data. The 32 elements of Osr26 family fall into at least 5 clades, with Osr25 as the outgroup. To better exhibit tree structure, all Osr27 elements were removed. Insertions/deletions were ignored while performing phylogenetic analyses. Values on individual branches are bootstrap percentages using 1000 bootstrap repetitions. Each LTR in the tree is named by the genomic clone in which it was found. For elements with two LTRs, the 3' LTR is labeled by a lower case "b" while the 5' LTR is labeled by a lower case "a". Each tree is exhibited with a scale bar determined by the number of nucleotide substitutions per site between two sequences. The tight clustering seen in both families represents a high degree of nucleotide identity between elements within a subfamily.  Phylogenetic trees of subfamily structure based on LTR nucleotide sequence data. The Osr27 family forms at least 10 divergent clades with strong bootstrap support, with Osr41 as the outgroup. To better exhibit tree structure, all Osr27 elements were removed. Insertions/deletions were ignored while performing phylogenetic analyses. Values on individual branches are bootstrap percentages using 1000 bootstrap repetitions. Each LTR in the tree is named by the genomic clone in which it was found. For elements with two LTRs, the 3' LTR is labeled by a lower case "b" while the 5' LTR is labeled by a lower case "a". Each tree is exhibited with a scale bar determined by the number of nucleotide substitutions per site between two sequences. The tight clustering seen in both families represents a high degree of nucleotide identity between elements within a subfamily.
tionally more gypsy-like than copia-like elements on chromosome 10 (17 vs. 5 elements per Mb DNA). Comparing total number of retrotransposons in rice genome (Table  1) and retrotransposons in Chromosome 10), we noticed that two families (Osr12 and 33) have more elements in Chromosome 10 than in the whole sequenced genome. This is not unexpected because many sequences from Chromosome 10 were released after we stopped searching rice genomic sequences in GenBank as of January 1, 2002 (see methods).

O. sativa LTR retroelement sequences are associated with putative genes
In an initial effort to determine whether and how frequently LTR retrotransposon sequences are associated with putative genes in the rice genome, we examined elements from all 19 families of copia-like elements and 5 O. sativa copia-like LTR-retrotransposon age calculated using intraelement LTR nucleotide similarities Figure 3 O. sativa copia-like LTR-retrotransposon age calculated using intraelement LTR nucleotide similarities. Only are those elements that contain LTR nucleotide divergence values other than zero included. followed by solo LTRs (3%) and full-length elements (2%). While these numbers are likely to change somewhat as the rice genome is better annotated, these preliminary estimates indicate that the potential contribution of LTR retrotranspson sequences to the evolution of gene structure and function in rice may be significant.

Conclusions
We have previously determined that two classes (gypsy-like and copia-like) of full-length LTR retrotransposons comprise ~17% of the Oryza sativa genome [32]. In this study, we have extended the earlier survey to include all identifiable fragments of LTR retrotransposon sequences. We have classified rice LTR retrotransposon sequences into three groups: full-length elements, solo LTRs, and fragmented elements. We have identified a total of 1219 LTR retrotransposon sequences in the region of the O. sativa genome analyzed in this study distributed over 38 families. Gypsy-like elements are >4 × more abundant than copia-like elements. Eleven of the thirty-eight investigated LTR-retrotransposon families display significant subfamily structure. We estimate that at least 46% of LTR-retrotransposons in the Oryza sativa genome are older than the age of the species (<680,000 years). A detailed examination of chromosome 10 revealed that LTR retrotranspo- The distribution of full-length, solo and fragmented LTR elements along the chromosome 10 Figure 7 The distribution of full-length, solo and fragmented LTR elements along the chromosome 10. The borders of the pericentromeric regions of chromosome 10 were assigned as being 5 cM from the center of the centromere on each arm. The remainders of the chromosome were designated as arms. Both copia-like and gypsy-like LTR retrotransposons exhibit nonrandom clustering along the chromosome. More LTR retrotransposons reside in the pericentric regions of the chromosome than in the arms, and short arm of Chromosome 10 also displays a higher density of the elements than long arm. The number of gypsylike elements are more than that of copia-like elements on chromosome 10. The W hole Chromosome copia-like gypsy-like Total son sequences are not randomly distributed across this chromosome but are more dense in the pericentric region. We found that aproximately 20% of Oryza sativa LTR retrotransposon sequences lie within putative genes and thus may play a significant role in gene evolution.

Sequence identification and Retrieval
Whole genome analysis LTRs representing previously identified families of rice LTR retrotransposons [32] were used as queries in BLASTN searches against rice genomic sequences present in GenBank as of January 1, 2002 http:// www.ncbi.nlm.nih.gov/. About ~17% of rice genomic sequences was represented in this release. Subsequent sequence similarity searches against 29,285,477 bp of the complete chromosome 10 sequences in the GeneBank were performed up to April 25, 2002. Only 38 retrotransposon families for which both LTRs have been identified [32] were included in our study. To be considered an LTR sequence in this study, a BLAST "hit" had to display • 60% sequence homology to the LTR query Insertion frequencies of full-length, solo and fragmented LTR elements along the chromosome 10 The Whole Chromosome copia-like gypsy-like Total sequence in a pair-wise comparison test [43] and have a size • 40% of that of the LTR query sequence. Each LTR identified by these criteria was given the name of the Osr (O. sativa retrotransposon) to which it was most homologous.

Multiple Sequence Alignments and Phylogenetic Analyses
Using the clone coordinates from the BLAST searches, the rice LTR sequences were copied and placed into individual files. Alignments were created using ClustalW and edited with MacVector 7.0 http://www.gcg.com/. ClustalX 1.8 [44] was used to generate neighbor-joining (NJ) trees with bootstrap values and visualized with TreeView 1.5.3 [45].

Age determination of elements
Full-length elements were aged by comparing their 5' and 3' LTR sequences [37]. Kimura-2 parameter distances (K) between 5' and 3' LTRs of individual elements were calculated by MEGA-2 [46]. The average substitution rate (r) of 6.5 × 10 -9 substitutions per synonymous site per year for grasses [40] was used to calibrate the ages of the rice LTRretrotransposons. The time (T) since element insertion was estimated using the formula T = K /2r, where T = time of divergence. K = divergence, and r = substitution rate [47].

Location of LTR-retrotransposon sequences on Chromosome 10
Rice LTR retrotransposons were used as queries in BLASTN searches against chromosome 10 sequences present in GenBank as of April 25, 2002. The distribution of LTR retrotransposon sequences on Chromosome 10 of Nipponbare was estimated based on the position of LTR retrotransposon sequences in PACs and BACs http://     Four cases of association between retroelements with O. sativa putative genes observed in this study Figure 10 Four cases of association between retroelements with O. sativa putative genes observed in this study. Red arrows indicate positions of LTR retroelements with the direction of transcription. Green bars represent NCBI database-predicted gene regions with their orientation of transcription. All the four associations (from F to I) are located in the following genomic clones: AP003627, AC079685, AP003021 and AP003144. (F) An Osr34 solo LTR is associated with two putative genes nearby (P0459B04. 22   Author's contribution L.G. carried out all data collection, completed sequence analyses, and drafted the manuscript. J.M. participated in the design, and coordination of the study and greatly contributed to the revision of the manuscript. E.M. provided all LTR sequences of rice retrotransposon families, and technically helped data collection in earlier stage of this study. E.G. provided technical assistance during sequence collection and data analyses, and gave valuable comments to improve the manuscript. All authors read and approved the final manuscript.
Publish with Bio Med Central and every scientist can read your work free of charge