A transcriptional sketch of a primary human breast cancer by 454 deep sequencing

Background The cancer transcriptome is difficult to explore due to the heterogeneity of quantitative and qualitative changes in gene expression linked to the disease status. An increasing number of "unconventional" transcripts, such as novel isoforms, non-coding RNAs, somatic gene fusions and deletions have been associated with the tumoral state. Massively parallel sequencing techniques provide a framework for exploring the transcriptional complexity inherent to cancer with a limited laboratory and financial effort. We developed a deep sequencing and bioinformatics analysis protocol to investigate the molecular composition of a breast cancer poly(A)+ transcriptome. This method utilizes a cDNA library normalization step to diminish the representation of highly expressed transcripts and biology-oriented bioinformatic analyses to facilitate detection of rare and novel transcripts. Results We analyzed over 132,000 Roche 454 high-confidence deep sequencing reads from a primary human lobular breast cancer tissue specimen, and detected a range of unusual transcriptional events that were subsequently validated by RT-PCR in additional eight primary human breast cancer samples. We identified and validated one deletion, two novel ncRNAs (one intergenic and one intragenic), ten previously unknown or rare transcript isoforms and a novel gene fusion specific to a single primary tissue sample. We also explored the non-protein-coding portion of the breast cancer transcriptome, identifying thousands of novel non-coding transcripts and more than three hundred reads corresponding to the non-coding RNA MALAT1, which is highly expressed in many human carcinomas. Conclusion Our results demonstrate that combining 454 deep sequencing with a normalization step and careful bioinformatic analysis facilitates the discovery and quantification of rare transcripts or ncRNAs, and can be used as a qualitative tool to characterize transcriptome complexity, revealing many hitherto unknown transcripts, splice isoforms, gene fusion events and ncRNAs, even at a relatively low sequence sampling.


Results:
We analyzed over 132,000 Roche 454 high-confidence deep sequencing reads from a primary human lobular breast cancer tissue specimen, and detected a range of unusual transcriptional events that were subsequently validated by RT-PCR in additional eight primary human breast cancer samples. We identified and validated one deletion, two novel ncRNAs (one intergenic and one intragenic), ten previously unknown or rare transcript isoforms and a novel gene fusion specific to a single primary tissue sample. We also explored the non-protein-coding portion of the breast cancer transcriptome, identifying thousands of novel non-coding transcripts and more than three hundred reads corresponding to the non-coding RNA MALAT1, which is highly expressed in many human carcinomas.

Conclusion:
Our results demonstrate that combining 454 deep sequencing with a normalization step and careful bioinformatic analysis facilitates the discovery and quantification of rare transcripts or ncRNAs, and can be used as a qualitative tool to characterize transcriptome complexity, revealing many hitherto unknown transcripts, splice isoforms, gene fusion events and ncRNAs, even at a relatively low sequence sampling.

Background
The classic image of the mammalian transcriptome is composed of a large assembly of spliced mRNAs, each structured with a capped 5' end, a 5' untranslated region, a coding sequence, a 3' untranslated region and a polyA tail, together with a relatively well-defined set of non-protein-coding RNAs with different functions (ribosomal, transfer, spliceosomal and small nucleolar RNAs), with most of the genome thought to be genetically inert. Transcriptome sequencing and annotation initiatives have challenged this view by discovering that most of the genome is actively transcribed to yield complex patterns of interlaced and overlapping transcripts, including tens of thousands long (>200 nt) non-protein-coding RNAs (ncRNA) [1][2][3].
Non-coding RNAs (ncRNAs) have emerged as a diverse and important class of functional transcripts, accounting for approximately the 1.5% of the transcriptional output of mammalian genomes [4,5]. The regulatory role of these molecules has been clearly established for some species such as microRNAs (miRNAs) or small nucleolar RNAs (snoRNAs) [6,7]. In addition, although most have not yet been studied, many of the observed long 'mRNA-like' ncRNAs are differentially expressed and developmentally regulated, and increasing numbers are being shown to function in a range of processes in cell and developmental biology [8][9][10][11][12][13].
Compared to wild-type, the cancer cell transcriptome is grossly altered. Microarray studies have revealed a host of aberrations (i.e. drastic changes in expression levels of specific transcripts), and recent RNA-seq studies have identified a set of cancer-specific transcripts and transcriptional variants in tissues and cell lines [14][15][16][17]. Common alterations found in tumors are gene fusions and aberrant splicing isoforms [18,19]. Although prevalent in blood tumors, gene fusions occur in all malignancies, and they account for 20% of human cancer morbidity [20]. Alternative splicing is often deregulated in cancer, probably as a consequence of quantitative alterations in the levels of expression of splicing regulators [21]; however, many examples of cancer-specific gene isoforms (CD44, BRCA1, survivin etc), whose expression seem to correlate with the disease, have been described in literature [22].
A link between ncRNAs and cancer is becoming increasingly evident. For example, two ncRNAs, PCGEM and DD3, are significantly over expressed in prostate cancer, HULC expression is significantly associated with hepatocellular carcinoma [23] and MALAT1 is known to be over expressed in several human carcinomas [24][25][26]. Additionally, genes encoding hundreds of highly conserved ncRNAs are altered in a significant percentage of leukaemia and carcinomas [27].
To explore this complexity, we employed the Roche 454 deep sequencing technology [28] and biology-oriented sequence analysis techniques to obtain a transcriptional snapshot of a normalized primary breast cancer cDNA library. Our approach is largely qualitative, aiming at the identification of transcriptional events associated with the cancer phenotype. These included gene fusions, gene deletions, rare or aberrant transcriptional isoforms, ncRNAs, and transcripts of unknown function (TUF); a subset of interesting transcripts was validated using RT-PCR on the RNA obtained from the original breast cancer sample as well as from other eight carcinomas with the same histotype. Globally, our results demonstrate that direct pyrosequencing of a normalized human cDNA library coupled with bioinformatic analysis complements quantitative investigations of gene expression by providing an accurate qualitative picture of a complex transcriptome, potentially unraveling tissue or disease-specific transcriptional events.

cDNA library preparation, emulsion PCR and pyrosequencing
Polyadenylated RNA was isolated from a breast invasive tumor sample (in situ lobular carcinoma, bilateral, with elevated mitotic and proliferative index, G3, Tamoxifen treated, identified by the code 1360), having a purity of 85-90%. cDNA was synthesized using Super SMART™ PCR cDNA Synthesis Kit (Clontech, Mountain View, CA). Prior informed consent for the research use of biological material from surgery was obtained for this sample. The ethics committee of the Institute for Biomedical Technologies -National Research Council approved the use of this biological sample for the study presented here. After reverse transcription, the cDNA library was normalized to obtain an equilibrated mix of low and high abundance mRNAs using Kamchatka crab double-strand nuclease (DSN) [29], as described in Additional file 1.
2.1 μg of normalized double stranded cDNA was sheared by nitrogen nebulization following the manufacturer's instruction (Roche, Basel, Switzerland). Ligation of the nebulized sample to specific adaptors and preparation of the single strand libraries (sstDNA) was performed as previously described [28]. After purification, nebulized sstDNA preparation was quantitated by RiboGreen RNA Quantitation Kit (Invitrogen Inc., Carlsbad, California). Quality was assessed using an Agilent Bioanalyzer. All purification steps were performed using MinElute PCR Purification Kit (Qiagen, Hilden, Germany).
The sstDNA library was then amplified by emulsion PCR performed in water-in-oil microvescicles. Each PCR reaction was recovered by propanol emulsion breaking and buffer washing and enriched for positive reaction beads. The beads were then washed; the primers were annealed and then counted using the Multisizer™ 3 Coulter Counter (Beckman Coulter, Inc. Fullerton, CA, USA). The kits for DNA fragmentation, polishing, capture on beads, emulsion PCR and sequencing were purchased from Roche Diagnostics. Samples were loaded onto 70x75 PicoTiter-Plate (PTP) and inserted in the 454 -Roche GS 20 Genome Sequencer for the pyrosequencing reaction.

Sequence redundancy reduction
Sequence reads were extracted from the raw pyrosequencing data following the manufacturer's technical documentation. The technical redundancy in the dataset (perfect sequence duplication) was removed using the NCBI nrdb program, included in the downloadable Blast suite http:/ /www.ncbi.nlm.nih.gov/BLAST/download.shtml. After mapping the remaining reads to the genome, we employed a second sequence redundancy reduction step for the analyses investigating the overlap between our reads and genomic features such as ENCODE regions, ncRNAs or genes. For this purpose, we used the CleanUp Algorithm [30] to generate a new non-redundant dataset, using stringent cut-off parameters (similarity > 98%, coverage threshold > 98%). We used the Cap3 assembler [31] to perform all the transcript assemblies.

Mapping to the transcriptome and genome
A detailed description of the bioinformatics methods used in this part of the work can be found in Additional file 1. All the database searches against known transcripts (such as ESTs) were performed using the NCBI BlastN program. Non-redundant sequence reads were compared with the human genome using Blat [32]. All human full-length transcripts annotated in UCSC database (all_mrna Table, all Human mRNAs from GenBank, human genome release hg18, March 2006) [33] were used as reference set for the classification. We defined a read as 'spliced' when mapping to a chromosome with a coverage > = 95% in at least two parts separated by a gap > = 50 nt. We classified a read as 'intragenic' when mapping at least partially within a known gene (either in an exonic or intronic region), otherwise it was classified as 'intergenic'. Additional criteria were used to build an 'exon-oriented' read classification.
A collection of Conserved Sequence Tags (CSTs) [34,35], obtained by a full-genome comparison of human and mouse genomes, was compared to the genome mappings of the cDNA reads, excluding reads located within known exons, to evaluate both conservation and coding propensity.

Bioinformatic identification of cancer-specific splice sites and fusion/deletion transcripts
The details of the bioinformatics strategy used for detection of gene fusions and deletions is described in detail in Additional file 1. Briefly, we first detected alignments (using reads at least 50-bp long) corresponding to putative chromosomal rearrangements and then identified putative translocation-mediated interchromosomal fusion transcripts by comparing the gene direction at the predicted breakpoints with known exon boundaries. Using a similar procedure, we identified intragenic deletion events. Predictions were compared with data from the chimerDB database [18].
To analyze cancer-associated splicing events we used the ASAP II database [19], which catalogues validated cancerassociated isoforms curated from EST sequencing data. We identified deep-sequencing reads with high-quality alignments and at least one splice site, and compared them with 273 high-confidence cancer-specific splice sites (LOD > = 3) from 198 genes in ASAP II database.

Analysis of non-protein coding transcripts
The breast cancer cDNA library reads were aligned to UCSC Known Genes FastA sequences (human genome release hg18, 260.731 entries) using BLAST, and were classified on the basis of their genomic location. The conservation profile of non-exonic reads was assessed using the UCSC PhastCons17way conservation score. A total of four different datasets were generated: intronic, extragenic, desert conserved and desert non-conserved. These datasets were subsequently cross referenced against CRITICA ncRNA predictions [12], a subset of RNAdb [36], and NONCODE [37]. Details of these bioinformatic analyses are available in Additional file 1. In addition, we assessed the overlap between cDNA reads and the ENCODE project annotation of novel transcribed region of unknown function [38] by intersecting high-quality genome-wide mappings with the genomic coordinates of the encodeRna Table at UCSC.

Biological validation of selected transcripts
Validation was performed by direct sequencing of the cDNA library and RT-PCR. We used RNA obtained from the original lobular breast cancer sample and from other eight tumors and performed RT-PCR using an oligo (dT) primer and SuperScriptTM II Reverse Transcriptase (Invitrogen Inc., Carlsbad, California) according to manufacturer's instructions. For fusion transcripts we sequenced individual PCR products after cloning them into the pCR ® II-TOPO TA vector (Invitrogen Inc., Carlsbad, California). Additional file five lists all the PCR primers and their annealing temperatures, together with the results of all validations experiments. Since we were investigating rare transcripts detected from a normalized cDNA library, we reasoned that RNA extracted directly from primary samples could be the best source of genetic material for validation.

Assessment of the cDNA library normalization before and after deep sequencing
Aiming to detect rarely expressed transcripts, we complemented the standard deep-sequencing protocol with a normalization step. Reference genes, which are often referred to as 'housekeeping genes', are frequently used to normalize mRNA levels between different samples [39]. In order to assess the success of the cDNA library normalization procedure before sequencing, PCR amplifications with selected probes corresponding to reference genes were performed. Three reference genes with different expression levels were chosen for the analysis (Additional file 1). A visual inspection of the amplification bands in Figure 1 confirms that the normalization procedure decreased the level of highly expressed transcripts and increased the strength of the bands corresponding to lowlevel transcripts. For example, the expression of GAPDH (Glyceraldeide-3-phosphate dehydrogenase) is reduced Assessment of the cDNA library normalization before sequencing Figure 1 Assessment of the cDNA library normalization before sequencing. RT-PCR amplification of three reference genes, each expressed constitutively but with different abundance in the cell, to test the normalization of the cDNA library before sequencing. X = control before normalization. N = normalized library. and the expression of weakly expressed HPRT1 (Hypoxanthine phosphoribosyltransferase 1) is increased.
To assess whether the library normalization was also reflected in the 454 sequence output, we counted the reads that could be unequivocally associated with ACTB (Beta-actin), GAPDH and HPRT and compared them with the abundance of these transcripts (from the same tissue and pathological state as our experimental sample) in a public EST library (see related Methods in Additional file 1). We applied a well-established statistical test for assessing significant differences in digital gene expression profiles [40] and found that that the sequence sampling reflects the normalization of the cDNA library, even at the relatively shallow depth of sequencing accomplished with 454 ( Table 1).

Statistics of the sequencing results
We used the NCBI nrdb software to filter out technical redundancy from the sequence dataset. We obtained 251,262 non-redundant sequence reads, fitting approximately a normal distribution with a median length of 88 nt and the third quartile at 102 nt ( Figure 2). After mapping the non-redundant reads to the genome (requiring a minimum coverage of 70%), we obtained a second dataset of 194,806 distinct sequences which excluded all reads with uncertain mapping. A threshold of 98% identity, 98% coverage and a single match on the genome was then used for comparisons with annotated transcripts, gene structures, highly conserved genomic regions, ENCODE regions, and ncRNAs, resulting in 132,113 reads ( Table  2). This dataset was used for all the other statistics in this section and will be referred to as the 98.98.1 dataset. The 98.98.1 dataset has been deposited at the EMBL Nucleotide Sequence Database as EST sequences with the Accession Numbers FN045784 to FN177896.
The aim of cDNA nebulization was to maximize the sampling of sequence length. In order to evaluate the effective coverage of full-length transcripts obtained with our protocol, we counted all the high-quality  (Figure 3). This is not an unexpected finding, since RT-PCR followed by cDNA synthesis is necessarily biased toward the 3' end of the transcript, unless controlled partial hydrolysis of RNA is performed before retrotranscription [41].
One effective way of exploring molecular diversity by sequencing is through analysis of mRNA 3'UTRs, which are rich in single-feature polymorphisms that distinguish closely related transcripts. The specificity of 3'UTR sequences allows effective annotation of individual mRNAs without assembly of complete cDNAs and can be useful in transcriptome profiling by sequencing [42]. However, caution should be used in data interpretation, as there is some evidence that 3'UTRs may be separately expressed (Wilhelm, Soldà, Mercer, Dinger, Simons, Glazov, Koopman and Mattick, unpublished data). We used the non-redundant dataset of human 3'UTRs (39,758 sequences) from UTRdb, a curated database of 5' and 3' untranslated sequences of eukaryotic mRNAs [43], as a target for the 98.98.1 sequence reads dataset, requiring perfect identity and coverage of at least 90% to accept a match. From a total of 18,262 matches to the UTRdb we obtained 9,178 reads which could be univocally associated with a single RefSeq transcript (~50% of the matches). We conclude that the 454 reads mapping with high quality on a transcriptome have a high 'resolution power', or ability to distinguish between transcript variants

Genomic classification of sequence reads
In order to characterize and annotate the breast cancer transcriptome, we mapped each read on the human genome and extracted all features associated with that target region. We then employed a hierarchical classification based on multiple criteria; the results are summarized in Table 3 and detailed in Additional file 2.
The first clustering divided all the genome matching reads in two large datasets: 'spliced' and 'unspliced' reads (see Methods). The 'unspliced' dataset was split into intragenic or intergenic. Intragenic reads were then assigned to 4 different classes: exon, intron, extended 5' and extended 3'. The 'spliced' dataset was also classified by location within a gene. In order to detect potentially novel transcriptional features we excluded the entire unspliced-exon dataset from further analyses, as this will mostly contain wellknown entities. We noticed that there are a significant number of matches in the intragenic non-exonic portion of genes, which we attribute to new exons, retained introns or intronic transcripts.
Genome-wide identification of coding and non-coding Conserved Sequence Tags (CST) in human and mouse genomes [34,35] provides a dataset of genome coordinates which can be correlated with our deep sequencing reads, especially those associated with putative novel transcripts. The distribution of the CSTs, divided in four categories (undefined; non-coding; coding; ultraconserved), is reported in Figure 4 and shows the normalized ratio between the numbers of the Conserved Sequence Tags in each category with the corresponding number of cDNA Distribution and statistics of the cDNA reads reads. The categories of the cDNA reads are a function of the number of high-quality matches for each of the reads: one single match on the genome; from 2 to 10 and more than 10. The sequences which match only once on the genome show an equal distribution between overlapping coding and non-coding CSTs, while sequences with multiple matches tend to overlap conserved regions with high coding potential.

Identification and primary validation of potential cancerassociated transcriptional events
Detailed bioinformatic analyses were performed on our breast cancer library to identify fusion transcripts, aberrant or novel splicing isoforms, as well as known cancerrelated splice variants (see Methods and Additional file 1). In total, we found 477 putative rearrangement events. It must be noted, however, that we expected the rate of false positives to be high, due to sequencing or PCR artifacts. A manually curated selected dataset, including only reads containing at least one end in proximity of a splice site, identified six putative translocation-mediated fusion events and two intragenic deletions; the relative sequences in FastA format are available from Additional File 3 and the genome mapping and annotation are in Additional file 4.
Fusion transcripts can be derived from either trans-splicing of separate pre-mRNA molecules [44] or from transcription of rearranged chromosomal regions in which sections of two separate chromosomes have been joined by translocation, deletion, or inversion [16,17]. A potentially interesting fusion event was detected from the sequence 107781_1044_1738 (115 nt long), which we renamed 4A, involving two genes located on different chromosomes: UBR4 (ubiquitin protein ligase E3 component n-recognin 4) on chromosome 1 and GLB1 (betagalactosidase-like protein) on chromosome 3. UBR4, commonly known as p600 or retinoblastoma proteinassociated factor 600, is a cellular target of the human papillomavirus type 16 E7 oncoprotein, contributing to anchorage-independent growth and cellular transformation. UBR4-E7 interaction strongly contributes to cellular transformation [45]. The GLB1 gene encodes beta-galactosidase-1 (EC 3.2.1.23), a lysosomal hydrolase that cleaves the terminal beta-galactose from ganglioside substrates and other glycoconjugates. The predicted fusion, verified by direct sequencing of the original cDNA library, links exon 16 of the gene UBR4 with the terminal exon (composed of coding and 3'UTR sequence) of the GLB1 gene. Our sequence is colinear with both transcripts and exon-exon junctions are clear in the hybrid sequence. The  predicted final processed fusion cDNA UBR4/GLB1 would be 14,022-bp long and would produce a large protein of 4,526 residues, which is shorter than the original UBR4 protein (5,183 residues). The UBR4/GLB1 fusion protein is identical to UBR4 up to residue 4,433, and then diverges significantly ( Figure 5). Fusion transcripts are common in haematological malignancies and they are also recognized as contributing to the pathogenesis of solid tumors [20]. Our sequencing approach thus allowed the reliable detection of one novel cancer-related gene fusion which we subsequently validated at the transcript level in the original sample, although we could not replicate clearly this finding in the other eight breast cancer RNAs examined (Additional file 5). Recent papers, which appeared while this work was under review, highlighted that a particular cancer cell line or tissue can harbour multiple gene fusions, many of which are likely not recurrent, in accordance with our results [16,17]. Further refinements to the current protocol, such as paired-end sequencing or coupling of long with short reads, will be more effective in the specific identification of fusion events. Indeed, a recent study using Paired-End diTags Identification of a UBR4/GBL1 fusion transcript (PETs) in cancer cell genomes detected 70 new fusion transcripts [46].
An example of a transcriptional or genomic deletion event is provided by sequence read 1B (167378_1645_3303), located on chromosome 8. We interpret this sequence as a deletion, probably due to a loop that causes the inclusion of exons 2 and 7 of the WHSC1L1 gene in an inverted order in the mature transcript ( Figure 6). This transcript was also confirmed by direct sequencing of the cDNA library and by RT-PCR in an additional breast cancer sample (Additional file 5). The WHSC1L1 gene is related to the Wolf-Hirschhorn syndrome candidate-1 gene and encodes a protein with PWWP (proline-tryptophan-tryptophan-proline) domains. Two alternatively spliced WHSC1L1 variants have been described. The long isoform contains a PHD-finger domain (an interleaved type of Zn finger chelating 2 Zn ions) and a SET domain (proteinprotein interaction domain); however, the function of the protein has not been determined yet and hence the relevance to cancer aetiology of this deletion is uncertain.
The median length of the sequence reads in this sequencing run was around 85 nt, which is considerably longer than what can be achieved with other deep sequencing technologies, although much shorter than the length achievable with the latest generation of 454 sequencers (FLX and Titanium). This read length made it possible to identify interesting transcripts (ncRNAs; novel isoforms; fusions; deletions) without any sophisticated molecular 'a priori' processing of the RNA, apart from library normalization. This approach may be particularly useful to researchers investigating the transcriptome of species with poor genome annotation.
Taking advantage the length of 454 sequencing reads we investigated alternative splicing events, which are known to be significantly altered in breast tumors and can be used to identify normal and cancerous breast tumor tissue [47]. We identified a 102 nt read, 045624_1590_1179 (which we renamed 6B), which corresponds to a known isoform spanning the Variable, Light and Join segments of Unigene cluster 449585, Immunoglobulin lambda joining 3, on chromosome 22q11.1-q11.2. When mapped to the genome, this read aligns to three different exons, the second and the third being separated by ~511,000 bases (Additional file 4). This rare isoform was validated both by direct sequencing of the cDNA library and by RT-PCR of the original RNA, but was not detected in any other breast cancer sample tested (Additional file 5). We also identified splice isoforms that had not previously been reported. The 119 nt read 023612_1809_1670, renamed AI4, spans a splice junction of the gene BC031316 and is supported by three spliced ESTs but by no full-length mRNA. This isoform was confirmed by direct sequencing on the original cDNA library ( Figure 7) and was detected by RT-PCR in multiple additional breast cancer samples (Additional file 5).
Specific cancer-associated transcript isoforms have been predicted based on differential representation of ESTs between tumor and healthy tissues. When we queried the ASAPII database of alternative splicing with our deep sequencing reads we identified five putative cancer-associated isoforms, one of which has a plausible role in breast cancer biology. Isoform b of HIGD1A (HIG1 domain family, member 1A) gene ( Figure 8) encodes a protein containing a domain likely to be involved in the hypoxia response. Growth and progression of breast cancers are accompanied by increased neovascularization (angiogenesis) and a variety of factors, including hypoxia, contribute to increased production of angiogenic factors [48]. The HIGD1A variant (isoform b) identified by deep sequencing contains a distinct 5'UTR and lacks a portion of the 5' coding region, compared to variant a. These dif-Identification of a deletion within the WHSC1L1 gene ferences cause translation initiation at a downstream AUG and a protein with a shorter N-terminus. The predicted association with cancer is strongly supported by the ASAPII LOD score for this transcript [49]. Our data suggest that specific isoforms of HIGD1A are expressed in breast cancer and may be associated with the disease state.

Experimental validation of selected transcript variants identified with bioinformatic analysis
A subset of the above-described transcripts, including some potential cancer-associated variants, were selected for biological validation by RT-PCR on RNA derived from the same tissue sample that was used for deep sequencing, as well as from other eight lobular breast cancers. Amplification products from the original sample were assessed by Sanger sequencing. We obtained experimental confirmation of one deletion, one novel intergenic transcript, one novel intragenic transcript, nine previously unannotated isoforms -two associated with exon skipping and seven with novel splice site usage -and one known, but rare isoform. A gene fusion event which we detected by deep sequencing could be validated only in the original sample. These results, the primer sequences, the transcript labels and the PCR product sizes are described in Additional file 5.

Detection of known and novel non-coding RNAs and of putative new transcriptional units
A number of approaches are currently available to discriminate protein-coding transcripts from ncRNAs [50].
Identification of a novel isoform of the BC031316 gene The most reliable way of evaluating the percentage of sequences that could represent real matches with ncRNAs is a direct sequence comparison with stringent criteria and an appropriate error model for 454 sequencing reads [51] (see Methods). In order to examine the expression of ncRNAs in our breast cancer sample, we first screened known ncRNAs selected from two independent non-coding RNA databases: RNAdb and NONCODE [36,37]. In total, 98 sequences corresponding to known ncRNAs were detected in our library (Table 4 and Additional file 6). Interestingly, some of them such as SRA1 and MALAT1, have previously been associated with other tumor types and might also play a role in breast cancer pathogenesis. The steroid receptor RNA activator (SRA) is a unique modulator of steroid receptor transcriptional activity that functions as a regulatory RNA assembled in a ribonucleoprotein complex. Recent findings, however, have shown that the SRA1 locus can produce both pro-Identification of a putative cancer-associated isoform Figure 8 Identification of a putative cancer-associated isoform. UCSC genome screenshot showing the region where the 176265_1466_0318 read maps, clearly identifying the shorter protein isoform (1b) of HIGD1A, which is predicted to be significantly associated with the 'cancer' phenotype according to ASAPII EST and transcript analysis. The black block identified as 27858 is the ASAPII intron associated with the cancer-specific HIGD1A isoform 1b splice site. tein-coding and non-coding transcripts which are involved in the regulation of estrogen and androgen receptor signaling pathways. Moreover, several reports have shown increased SRA expression in breast, uterus and ovarian cancers, and a possible direct involvement of SRA transcript in the mechanisms underlying breast tumorigenesis and tumor progression has been proposed [52].
The most abundant ncRNA we detected was MALAT1 (metastasis associated lung adenocarcinoma transcript 1). MALAT1 is a conserved 8-kb ncRNA whose expression correlates with the risk of developing metastasis in nonsmall-cell lung cancer (NSCLC) patients. Recent studies have also reported the overexpression of MALAT1 in uterine endometrial stromal sarcoma and hepatocellular carcinoma [25,26]. We found 309 reads mapping along this regulatory ncRNAs, which, when assembled with the cap3 program [31], gave rise to 14 contigs (sequences available in Additional file 3) distributed along all the length of the ncRNA ( Figure 9). Interestingly, only 6 of the 309 reads map to portion of MALAT1 which is cleaved in the nucleus and generates a cytoplasmic tRNA molecule [53]. This observation suggests that MALAT1 is abundantly expressed in our primary sample, in accordance with previous results [26], and prompted us to further investigate this finding.
We found that MALAT1 is abundantly expressed in all the publicly available annotated breast cancer samples retrieved from the CleanEx database http:// www.cleanex.isb-sib.ch/ [54]. Detailed analysis of an Affymetrix ER+ Tamoxifen-treated and untreated breast cancer data set [55] showed a relevant variation in MALAT1 transcript abundance, including a few outliers with very high expression of this ncRNA. Further analysis of cDNA microarrays, probed with total polyA+ RNA from ER+ lobular and ductal breast cancers treated with Tamoxifen, showed the same expression patterns [56]. This reinforces our finding that high MALAT1 expression may be episodically associated with single breast tumors and that the sensitivity of our deep sequencing approach facilitated the detection of this ncRNA in our sample. We also noticed that the Coefficient of Variation of gene expression values was higher in Tamoxifen-treated versus Tamoxifen-untreated breast cancer samples (84% versus 43% for the Affymetrix array experiments dataset) (Additional file 5).
Surprisingly, we found 23 reads corresponding to PIWIinteracting RNAs (piRNAs), which are thought to be selectively expressed in male and female gonads and are important for the control of transposable elements during germline development [7]. However, piRNA expression in Abundant representation of MALAT1 ncRNA in 454 sequences from the breast tumor sample  breast cancer is not totally unexpected. A mechanism of piRNA biogenesis that is not confined to the germline has recently been described [57].
Sequences that mapped to 'gene deserts', which lie at least 2 kb from the boundaries of any known transcript, and which resulted highly conserved according to UCSC PhastCons17, were manually examined for overlap with ESTs, transcript predictions and non-coding RNAs annotations derived from CRITICA [2,8,58]. Eleven per cent (684 of 5,950) of the reads that mapped to 'gene deserts' actually overlap CRITICA-predicted non-coding transcripts or are supported by EST data. These reads likely represent non-protein-coding genes (Additional file 6 and Figure 10), or exons at a significant distance (greater than 5 kb) from a gene, belonging to exceptionally extended 3'UTRs. In agreement with our data, evidence of extended 3'UTR has been recently reported in a deep-sequencing screen of the mouse transcriptome [41]. We also compared the reads mapping at least 5 kb from any known transcript (1,069 sequences) with a collection of sequences that are highly conserved between human and mouse and which are classified in order of coding potential (CSTs) [35]. We were able to identify 314 reads overlapping one or more CSTs with coding potential, and a further 351 reads overlapping non-coding CSTs. These results suggest that our deep sequencing and bioinformatics protocols are capable of detecting rare and novel transcripts outside known gene structures (Additional file 7).
The ENCODE annotation of transcriptionally active regions [38] (Transcripts of Unknown Function: TUFs) covers only 1% of the genome. However, we found 135 reads that overlap with 60 distinct ENCODE TUFs. We identified individual TUFs with both single and multiple reads, confirming our protocol's efficacy in enriching for non-canonical transcripts (Additional file 8).

Functional annotation of the coding part of the sequenced transcriptome
Functional annotation of transcripts has become an important aspect of microarray studies, and many tools are now available to assess gene expression biases [59].
Using the functional annotation strategies that are usually applied to microarray experiments functional annotation, we examined the genes identified by deep sequencing. 454 reads mapped to 6.067 RefSeq transcripts (Additional file 9) with counts per transcript ranging from 82 to 1, with median of 2. AKAP9, which interacts with multiple signal transduction pathways, had the highest number of counts, followed closely by the ncRNAs MALAT1 and XIST. Among transcripts with very few counts we identified a number of annotated pseudogenes.
We applied the DAVID on-line analytical tool [60] to identify enrichment of specific Gene Ontology (GO) terms among the genes which had at least two cDNA read counts (3,589 genes). The most enriched categories are related with protein and nucleic acid binding and with catalysis, as expected in an actively proliferating tissue ( Figure 11).

Conclusion
Quantitative transcriptional analysis of all the genes expressed by breast tumors has provided the first steps towards defining a molecular signature for the disease, and might ultimately make conventional diagnostic techniques obsolete. The qualitative analysis of the breast can-Identification of a novel conserved noncoding transcript Figure 10 Identification of a novel conserved noncoding transcript. UCSC human genome 8 kb screenshot of a "gene desert region" (no known gene in a 50 kb boundary) on the X chromosome tagged as transcribed by the sequencing read 002770_3171_2414. The read overlaps a CRITICA-predicted putative noncoding transcript (CR621898) and points to a new, highly conserved transcriptional island, according to a vertebrate 28 multi-species alignment and PhastCons conservation score.
cer transcriptome -such as the one obtained by massive cDNA sequencing and presented here -should instead contribute different and complementary information: the identification of novel possible pathogenic determinants (gene fusions and genome deletions) or biomarkers (aberrant or novel transcripts and isoforms, intronic and extragenic ncRNAs, expressed pseudogenes).
We demonstrated in this work that 454 deep sequencing of a normalized cDNA library, coupled with detailed biology-oriented bioinformatic analyses, has the potential to identify transcripts that may further our understanding of the breast cancer transcriptome, even starting from a relatively small number of sequences. In our primary breast cancer cDNA library and in a number of additional samples with a matching histotype, we have identified and validated several unusual transcriptional events that could be suitable for subsequent functional studies: gene fusions, gene deletions, novel or cancer-associated isoforms and putative novel ncRNAs.
We have also identified from our sequences a very high expression of the cancer-associated MALAT1 ncRNA and we replicated this observation in two different gene expression profiling experiments of well-annotated ER+ breast cancer patient cohorts, finding also an high variance between Tamoxifen treated and untreated patient samples. Although further technical refinements, such as controlled hydrolysis of RNA samples before cDNA synthesis and paired-end or di-tags sequencing, can increase significantly the number and diversity of sequences which can be annotated, our protocol has proved to be very effective in detecting rare or novel transcriptional events. Based on the results presented here, we are confident that further deep sequencing experiments and a similar bioinformatic analysis strategy will yield an even more comprehensive and detailed picture of the breast cancer transcriptome.