A transcriptional sketch of a primary human breast cancer by 454 deep sequencing
- Alessandro Guffanti†1, 2Email author,
- Michele Iacono†1,
- Paride Pelucchi†1,
- Namshin Kim3, 4,
- Giulia Soldà5,
- Larry J Croft6,
- Ryan J Taft6,
- Ermanno Rizzi1,
- Marjan Askarian-Amiri6,
- Raoul J Bonnal1,
- Maurizio Callari7,
- Flavio Mignone8,
- Graziano Pesole1, 9,
- Giovanni Bertalot10, 11,
- Luigi Rossi Bernardi12,
- Alberto Albertini1,
- Christopher Lee3,
- John S Mattick6,
- Ileana Zucchi1 and
- Gianluca De Bellis1
© Guffanti et al; licensee BioMed Central Ltd. 2009
Received: 28 August 2008
Accepted: 20 April 2009
Published: 20 April 2009
The cancer transcriptome is difficult to explore due to the heterogeneity of quantitative and qualitative changes in gene expression linked to the disease status. An increasing number of "unconventional" transcripts, such as novel isoforms, non-coding RNAs, somatic gene fusions and deletions have been associated with the tumoral state. Massively parallel sequencing techniques provide a framework for exploring the transcriptional complexity inherent to cancer with a limited laboratory and financial effort. We developed a deep sequencing and bioinformatics analysis protocol to investigate the molecular composition of a breast cancer poly(A)+ transcriptome. This method utilizes a cDNA library normalization step to diminish the representation of highly expressed transcripts and biology-oriented bioinformatic analyses to facilitate detection of rare and novel transcripts.
We analyzed over 132,000 Roche 454 high-confidence deep sequencing reads from a primary human lobular breast cancer tissue specimen, and detected a range of unusual transcriptional events that were subsequently validated by RT-PCR in additional eight primary human breast cancer samples. We identified and validated one deletion, two novel ncRNAs (one intergenic and one intragenic), ten previously unknown or rare transcript isoforms and a novel gene fusion specific to a single primary tissue sample. We also explored the non-protein-coding portion of the breast cancer transcriptome, identifying thousands of novel non-coding transcripts and more than three hundred reads corresponding to the non-coding RNA MALAT1, which is highly expressed in many human carcinomas.
Our results demonstrate that combining 454 deep sequencing with a normalization step and careful bioinformatic analysis facilitates the discovery and quantification of rare transcripts or ncRNAs, and can be used as a qualitative tool to characterize transcriptome complexity, revealing many hitherto unknown transcripts, splice isoforms, gene fusion events and ncRNAs, even at a relatively low sequence sampling.
The classic image of the mammalian transcriptome is composed of a large assembly of spliced mRNAs, each structured with a capped 5' end, a 5' untranslated region, a coding sequence, a 3' untranslated region and a polyA tail, together with a relatively well-defined set of non-protein-coding RNAs with different functions (ribosomal, transfer, spliceosomal and small nucleolar RNAs), with most of the genome thought to be genetically inert. Transcriptome sequencing and annotation initiatives have challenged this view by discovering that most of the genome is actively transcribed to yield complex patterns of interlaced and overlapping transcripts, including tens of thousands long (>200 nt) non-protein-coding RNAs (ncRNA) [1–3].
Non-coding RNAs (ncRNAs) have emerged as a diverse and important class of functional transcripts, accounting for approximately the 1.5% of the transcriptional output of mammalian genomes [4, 5]. The regulatory role of these molecules has been clearly established for some species such as microRNAs (miRNAs) or small nucleolar RNAs (snoRNAs) [6, 7]. In addition, although most have not yet been studied, many of the observed long 'mRNA-like' ncRNAs are differentially expressed and developmentally regulated, and increasing numbers are being shown to function in a range of processes in cell and developmental biology [8–13].
Compared to wild-type, the cancer cell transcriptome is grossly altered. Microarray studies have revealed a host of aberrations (i.e. drastic changes in expression levels of specific transcripts), and recent RNA-seq studies have identified a set of cancer-specific transcripts and transcriptional variants in tissues and cell lines [14–17]. Common alterations found in tumors are gene fusions and aberrant splicing isoforms [18, 19]. Although prevalent in blood tumors, gene fusions occur in all malignancies, and they account for 20% of human cancer morbidity . Alternative splicing is often deregulated in cancer, probably as a consequence of quantitative alterations in the levels of expression of splicing regulators ; however, many examples of cancer-specific gene isoforms (CD44, BRCA1, survivin etc), whose expression seem to correlate with the disease, have been described in literature .
A link between ncRNAs and cancer is becoming increasingly evident. For example, two ncRNAs, PCGEM and DD3, are significantly over expressed in prostate cancer, HULC expression is significantly associated with hepatocellular carcinoma  and MALAT1 is known to be over expressed in several human carcinomas [24–26]. Additionally, genes encoding hundreds of highly conserved ncRNAs are altered in a significant percentage of leukaemia and carcinomas .
To explore this complexity, we employed the Roche 454 deep sequencing technology  and biology-oriented sequence analysis techniques to obtain a transcriptional snapshot of a normalized primary breast cancer cDNA library. Our approach is largely qualitative, aiming at the identification of transcriptional events associated with the cancer phenotype. These included gene fusions, gene deletions, rare or aberrant transcriptional isoforms, ncRNAs, and transcripts of unknown function (TUF); a subset of interesting transcripts was validated using RT-PCR on the RNA obtained from the original breast cancer sample as well as from other eight carcinomas with the same histotype. Globally, our results demonstrate that direct pyrosequencing of a normalized human cDNA library coupled with bioinformatic analysis complements quantitative investigations of gene expression by providing an accurate qualitative picture of a complex transcriptome, potentially unraveling tissue or disease-specific transcriptional events.
cDNA library preparation, emulsion PCR and pyrosequencing
Polyadenylated RNA was isolated from a breast invasive tumor sample (in situ lobular carcinoma, bilateral, with elevated mitotic and proliferative index, G3, Tamoxifen treated, identified by the code 1360), having a purity of 85–90%. cDNA was synthesized using Super SMART™ PCR cDNA Synthesis Kit (Clontech, Mountain View, CA). Prior informed consent for the research use of biological material from surgery was obtained for this sample. The ethics committee of the Institute for Biomedical Technologies – National Research Council approved the use of this biological sample for the study presented here. After reverse transcription, the cDNA library was normalized to obtain an equilibrated mix of low and high abundance mRNAs using Kamchatka crab double-strand nuclease (DSN) , as described in Additional file 1.
2.1 μg of normalized double stranded cDNA was sheared by nitrogen nebulization following the manufacturer's instruction (Roche, Basel, Switzerland). Ligation of the nebulized sample to specific adaptors and preparation of the single strand libraries (sstDNA) was performed as previously described . After purification, nebulized sstDNA preparation was quantitated by RiboGreen RNA Quantitation Kit (Invitrogen Inc., Carlsbad, California). Quality was assessed using an Agilent Bioanalyzer. All purification steps were performed using MinElute PCR Purification Kit (Qiagen, Hilden, Germany).
The sstDNA library was then amplified by emulsion PCR performed in water-in-oil microvescicles. Each PCR reaction was recovered by propanol emulsion breaking and buffer washing and enriched for positive reaction beads. The beads were then washed; the primers were annealed and then counted using the Multisizer™ 3 Coulter Counter (Beckman Coulter, Inc. Fullerton, CA, USA). The kits for DNA fragmentation, polishing, capture on beads, emulsion PCR and sequencing were purchased from Roche Diagnostics. Samples were loaded onto 70x75 PicoTiterPlate (PTP) and inserted in the 454 – Roche GS 20 Genome Sequencer for the pyrosequencing reaction.
Sequence redundancy reduction
Sequence reads were extracted from the raw pyrosequencing data following the manufacturer's technical documentation. The technical redundancy in the dataset (perfect sequence duplication) was removed using the NCBI nrdb program, included in the downloadable Blast suite http://www.ncbi.nlm.nih.gov/BLAST/download.shtml. After mapping the remaining reads to the genome, we employed a second sequence redundancy reduction step for the analyses investigating the overlap between our reads and genomic features such as ENCODE regions, ncRNAs or genes. For this purpose, we used the CleanUp Algorithm  to generate a new non-redundant dataset, using stringent cut-off parameters (similarity > 98%, coverage threshold > 98%). We used the Cap3 assembler  to perform all the transcript assemblies.
Mapping to the transcriptome and genome
A detailed description of the bioinformatics methods used in this part of the work can be found in Additional file 1. All the database searches against known transcripts (such as ESTs) were performed using the NCBI BlastN program. Non-redundant sequence reads were compared with the human genome using Blat . All human full-length transcripts annotated in UCSC database (all_mrna Table, all Human mRNAs from GenBank, human genome release hg18, March 2006)  were used as reference set for the classification. We defined a read as 'spliced' when mapping to a chromosome with a coverage > = 95% in at least two parts separated by a gap > = 50 nt. We classified a read as 'intragenic' when mapping at least partially within a known gene (either in an exonic or intronic region), otherwise it was classified as 'intergenic'. Additional criteria were used to build an 'exon-oriented' read classification.
A collection of Conserved Sequence Tags (CSTs) [34, 35], obtained by a full-genome comparison of human and mouse genomes, was compared to the genome mappings of the cDNA reads, excluding reads located within known exons, to evaluate both conservation and coding propensity.
Bioinformatic identification of cancer-specific splice sites and fusion/deletion transcripts
The details of the bioinformatics strategy used for detection of gene fusions and deletions is described in detail in Additional file 1. Briefly, we first detected alignments (using reads at least 50-bp long) corresponding to putative chromosomal rearrangements and then identified putative translocation-mediated interchromosomal fusion transcripts by comparing the gene direction at the predicted breakpoints with known exon boundaries. Using a similar procedure, we identified intragenic deletion events. Predictions were compared with data from the chimerDB database .
To analyze cancer-associated splicing events we used the ASAP II database , which catalogues validated cancer-associated isoforms curated from EST sequencing data. We identified deep-sequencing reads with high-quality alignments and at least one splice site, and compared them with 273 high-confidence cancer-specific splice sites (LOD > = 3) from 198 genes in ASAP II database.
Analysis of non-protein coding transcripts
The breast cancer cDNA library reads were aligned to UCSC Known Genes FastA sequences (human genome release hg18, 260.731 entries) using BLAST, and were classified on the basis of their genomic location. The conservation profile of non-exonic reads was assessed using the UCSC PhastCons17way conservation score. A total of four different datasets were generated: intronic, extragenic, desert conserved and desert non-conserved. These datasets were subsequently cross referenced against CRITICA ncRNA predictions , a subset of RNAdb , and NONCODE . Details of these bioinformatic analyses are available in Additional file 1. In addition, we assessed the overlap between cDNA reads and the ENCODE project annotation of novel transcribed region of unknown function  by intersecting high-quality genome-wide mappings with the genomic coordinates of the encodeRna Table at UCSC.
Biological validation of selected transcripts
Validation was performed by direct sequencing of the cDNA library and RT-PCR. We used RNA obtained from the original lobular breast cancer sample and from other eight tumors and performed RT-PCR using an oligo (dT) primer and SuperScriptTM II Reverse Transcriptase (Invitrogen Inc., Carlsbad, California) according to manufacturer's instructions. For fusion transcripts we sequenced individual PCR products after cloning them into the pCR®II-TOPO TA vector (Invitrogen Inc., Carlsbad, California). Additional file five lists all the PCR primers and their annealing temperatures, together with the results of all validations experiments. Since we were investigating rare transcripts detected from a normalized cDNA library, we reasoned that RNA extracted directly from primary samples could be the best source of genetic material for validation.
Results and discussion
Assessment of the cDNA library normalization before and after deep sequencing
Assessment of the cDNA library normalization by sequence count
454 Reads mapped to the genome (194,806)
Probability of differential expression between the libraries
Prob > 0.999
Prob > 0.999
0.5 < Prob < 0.6
Statistics of the sequencing results
Primary classification of the 454 sequencing reads
Number of reads
Mapping to the genome, 70% coverage, high stringency
Subset with a single match on the genome at 98% identity and 98% coverage (98.98.1 dataset) 1
Subset with a single match on the genome and 100% coverage of the alignment2
Subset of 98.98.1 dataset matching with max 6 errors (mismacthes + indels) and 90% coverage on UCSC all_mrna and RefSeq – canonical transcripts dataset
Subset of 98.98.1 dataset matching inside an UCSC Known Gene (Intragenic dataset, intronic + exonic transcripts)
Matching with max 6 errors (mismatches + indels) and 90% coverage to the Human ORESTES EST dataset (764,587 sequences)
One effective way of exploring molecular diversity by sequencing is through analysis of mRNA 3'UTRs, which are rich in single-feature polymorphisms that distinguish closely related transcripts. The specificity of 3'UTR sequences allows effective annotation of individual mRNAs without assembly of complete cDNAs and can be useful in transcriptome profiling by sequencing . However, caution should be used in data interpretation, as there is some evidence that 3'UTRs may be separately expressed (Wilhelm, Soldà, Mercer, Dinger, Simons, Glazov, Koopman and Mattick, unpublished data). We used the non-redundant dataset of human 3'UTRs (39,758 sequences) from UTRdb, a curated database of 5' and 3' untranslated sequences of eukaryotic mRNAs , as a target for the 98.98.1 sequence reads dataset, requiring perfect identity and coverage of at least 90% to accept a match. From a total of 18,262 matches to the UTRdb we obtained 9,178 reads which could be univocally associated with a single RefSeq transcript (~50% of the matches). We conclude that the 454 reads mapping with high quality on a transcriptome have a high 'resolution power', or ability to distinguish between transcript variants
Genomic classification of sequence reads
Genomic classification of the 454 sequencing reads
Number of reads
Intragenic Unspliced – total
The first clustering divided all the genome matching reads in two large datasets: 'spliced' and 'unspliced' reads (see Methods). The 'unspliced' dataset was split into intragenic or intergenic. Intragenic reads were then assigned to 4 different classes: exon, intron, extended 5' and extended 3'. The 'spliced' dataset was also classified by location within a gene. In order to detect potentially novel transcriptional features we excluded the entire unspliced-exon dataset from further analyses, as this will mostly contain well-known entities. We noticed that there are a significant number of matches in the intragenic non-exonic portion of genes, which we attribute to new exons, retained introns or intronic transcripts.
Identification and primary validation of potential cancer-associated transcriptional events
Detailed bioinformatic analyses were performed on our breast cancer library to identify fusion transcripts, aberrant or novel splicing isoforms, as well as known cancer-related splice variants (see Methods and Additional file 1). In total, we found 477 putative rearrangement events. It must be noted, however, that we expected the rate of false positives to be high, due to sequencing or PCR artifacts. A manually curated selected dataset, including only reads containing at least one end in proximity of a splice site, identified six putative translocation-mediated fusion events and two intragenic deletions; the relative sequences in FastA format are available from Additional File 3 and the genome mapping and annotation are in Additional file 4.
The median length of the sequence reads in this sequencing run was around 85 nt, which is considerably longer than what can be achieved with other deep sequencing technologies, although much shorter than the length achievable with the latest generation of 454 sequencers (FLX and Titanium). This read length made it possible to identify interesting transcripts (ncRNAs; novel isoforms; fusions; deletions) without any sophisticated molecular 'a priori' processing of the RNA, apart from library normalization. This approach may be particularly useful to researchers investigating the transcriptome of species with poor genome annotation.
Experimental validation of selected transcript variants identified with bioinformatic analysis
A subset of the above-described transcripts, including some potential cancer-associated variants, were selected for biological validation by RT-PCR on RNA derived from the same tissue sample that was used for deep sequencing, as well as from other eight lobular breast cancers. Amplification products from the original sample were assessed by Sanger sequencing. We obtained experimental confirmation of one deletion, one novel intergenic transcript, one novel intragenic transcript, nine previously unannotated isoforms – two associated with exon skipping and seven with novel splice site usage – and one known, but rare isoform. A gene fusion event which we detected by deep sequencing could be validated only in the original sample. These results, the primer sequences, the transcript labels and the PCR product sizes are described in Additional file 5.
Detection of known and novel non-coding RNAs and of putative new transcriptional units
Annotation of the non-coding part of the transcriptome
Number of unique ncRNAs matching the breast cancer library4
Long regulatory RNAs 1 :
Cancer associated transcripts
Predicted conserved secondary structure 3
We found that MALAT1 is abundantly expressed in all the publicly available annotated breast cancer samples retrieved from the CleanEx database http://www.cleanex.isb-sib.ch/. Detailed analysis of an Affymetrix ER+ Tamoxifen-treated and untreated breast cancer data set  showed a relevant variation in MALAT1 transcript abundance, including a few outliers with very high expression of this ncRNA. Further analysis of cDNA microarrays, probed with total polyA+ RNA from ER+ lobular and ductal breast cancers treated with Tamoxifen, showed the same expression patterns . This reinforces our finding that high MALAT1 expression may be episodically associated with single breast tumors and that the sensitivity of our deep sequencing approach facilitated the detection of this ncRNA in our sample. We also noticed that the Coefficient of Variation of gene expression values was higher in Tamoxifen-treated versus Tamoxifen-untreated breast cancer samples (84% versus 43% for the Affymetrix array experiments dataset) (Additional file 5).
Surprisingly, we found 23 reads corresponding to PIWI-interacting RNAs (piRNAs), which are thought to be selectively expressed in male and female gonads and are important for the control of transposable elements during germline development . However, piRNA expression in breast cancer is not totally unexpected. A mechanism of piRNA biogenesis that is not confined to the germline has recently been described .
The ENCODE annotation of transcriptionally active regions  (Transcripts of Unknown Function: TUFs) covers only 1% of the genome. However, we found 135 reads that overlap with 60 distinct ENCODE TUFs. We identified individual TUFs with both single and multiple reads, confirming our protocol's efficacy in enriching for non-canonical transcripts (Additional file 8).
Functional annotation of the coding part of the sequenced transcriptome
Functional annotation of transcripts has become an important aspect of microarray studies, and many tools are now available to assess gene expression biases . Using the functional annotation strategies that are usually applied to microarray experiments functional annotation, we examined the genes identified by deep sequencing. 454 reads mapped to 6.067 RefSeq transcripts (Additional file 9) with counts per transcript ranging from 82 to 1, with median of 2. AKAP9, which interacts with multiple signal transduction pathways, had the highest number of counts, followed closely by the ncRNAs MALAT1 and XIST. Among transcripts with very few counts we identified a number of annotated pseudogenes.
Quantitative transcriptional analysis of all the genes expressed by breast tumors has provided the first steps towards defining a molecular signature for the disease, and might ultimately make conventional diagnostic techniques obsolete. The qualitative analysis of the breast cancer transcriptome – such as the one obtained by massive cDNA sequencing and presented here – should instead contribute different and complementary information: the identification of novel possible pathogenic determinants (gene fusions and genome deletions) or biomarkers (aberrant or novel transcripts and isoforms, intronic and extragenic ncRNAs, expressed pseudogenes).
We demonstrated in this work that 454 deep sequencing of a normalized cDNA library, coupled with detailed biology-oriented bioinformatic analyses, has the potential to identify transcripts that may further our understanding of the breast cancer transcriptome, even starting from a relatively small number of sequences. In our primary breast cancer cDNA library and in a number of additional samples with a matching histotype, we have identified and validated several unusual transcriptional events that could be suitable for subsequent functional studies: gene fusions, gene deletions, novel or cancer-associated isoforms and putative novel ncRNAs.
We have also identified from our sequences a very high expression of the cancer-associated MALAT1 ncRNA and we replicated this observation in two different gene expression profiling experiments of well-annotated ER+ breast cancer patient cohorts, finding also an high variance between Tamoxifen treated and untreated patient samples. Although further technical refinements, such as controlled hydrolysis of RNA samples before cDNA synthesis and paired-end or di-tags sequencing, can increase significantly the number and diversity of sequences which can be annotated, our protocol has proved to be very effective in detecting rare or novel transcriptional events. Based on the results presented here, we are confident that further deep sequencing experiments and a similar bioinformatic analysis strategy will yield an even more comprehensive and detailed picture of the breast cancer transcriptome.
This work was supported by the following research grants: CARIPLO grant 2006-0772 'Genomic, epigenetic and transcriptional analysis of tumors by deep sequencing' to IZ and GdB; Italian Fund for Basic Research grant 'Large Laboratories' RBLA03ER38 to GdB; Net2Drug grant n. 037590 to IZ. PP fellowship is supported by the CARIPLO-NOBEL grant to IZ. Bioinformatic analysis and validation strategies are based on the methods developed in the research grant 'Identification of new cancer biomarkers through bioinformatics and application to tumor prognosis and therapy' assigned to AG by Italian Cancer Research Association in 2004.
JSM and LJC are supported by grants from the Australian Research Council ( FF0561986 and S00001543) and the National Health and Medical Research Council (DP456080). RJT is supported by a United States National Science Foundation Graduate Research Fellowship.
We gratefully acknowledge the precious support for HPC of Ivan Merelli, ITB Bioinformatics, and of Elia Biganzoli and Fabio Frascati (Department of Biostatistics, University of Milano) for effective help with statistical analysis.
Compute-intensive bioinformatic tasks were performed on the VITAL-IT cluster at Lausanne, Switzerland http://www.vital-it.ch/ thanks to a Transnational Access Programme grant to AG; and on the bioinformatic cluster 'Michelangelo' of the Laboratory for Interdisciplinary Technologies in Bioinformatics at CILEA, Segrate, Milano, Italy http://www.litbio.org/. Free temporary access in the framework of this research project to the JMP7 statistical discovery software was granted by SAS Italy to AG.
- Carninci P, Yasuda J, Hayashizaki Y: Multifaceted mammalian transcriptome. Curr Opin Cell Biol. 2008, 20 (3): 274-80. 10.1016/j.ceb.2008.03.008.View ArticlePubMedGoogle Scholar
- Furuno M, Pang KC, Ninomiya N, Fukuda S, Frith MC, Bult C, Kai C, Kawai J, Carninci P, Hayashizaki Y, Mattick JS, Suzuki H: Clusters of internally primed transcripts reveal novel long noncoding RNAs. PLoS Genet. 2006, 2 (4): e37-10.1371/journal.pgen.0020037.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu Qian Jia, Du Jiang, Rozowsky Joel, Zhang Zhengdong, Urban Alexander, Ghia Euskirchen, Sherman Weissman, Gerstein Mark, Snyder Michael: Systematic analysis of transcribed loci in ENCODE regions using RACE sequencing reveals extensive transcription in the human genome. Genome Biol. 2008, 9 (1): R3-10.1186/gb-2008-9-1-r3.PubMed CentralView ArticlePubMedGoogle Scholar
- Mattick JS, Makunin IV: Non-coding RNA. Hum Mol Genet. 2006, 15 (Spec No 1): R17-29. 10.1093/hmg/ddl046.View ArticlePubMedGoogle Scholar
- Prasanth KV, Spector DL: Eukaryotic regulatory RNAs: an answer to the 'genome complexity' conundrum. Genes Dev. 2007, 21 (1): 11-42. 10.1101/gad.1484207.View ArticlePubMedGoogle Scholar
- Lestrade L, Weber MJ: snoRNA-LBME-db, a comprehensive database of human H/ACA and C/D box snoRNAs. Nucleic Acids Res. 2006, D158-62. 10.1093/nar/gkj002. 34 Database
- Stefani G, Slack FJ: Small non-coding RNAs in animal development. Nat Rev Mol Cell Biol. 2008, 9 (3): 219-30. 10.1038/nrm2347.View ArticlePubMedGoogle Scholar
- Ravasi T, Suzuki H, Pang KC, Katayama S, Furuno M, Okunishi R, Fukuda S, Ru K, Frith MC, Gongora MM, Grimmond SM, Hume DA, Hayashizaki Y, Mattick JS: Experimental validation of the regulated expression of large numbers of non-coding RNAs from the mouse genome. Genome Res. 2006, 16 (1): 11-9. 10.1101/gr.4200206.PubMed CentralView ArticlePubMedGoogle Scholar
- Mattick JS: A new paradigm for developmental biology. J Exp Biol. 2007, 210: 1526-1547. 10.1242/jeb.005017.View ArticlePubMedGoogle Scholar
- Mehler MF, Mattick JS: Noncoding RNAs and RNA editing in brain development, functional diversification, and neurological disease. Physiol Rev. 2007, 87: 799-823. 10.1152/physrev.00036.2006.View ArticlePubMedGoogle Scholar
- Amaral PP, Mattick JS: Noncoding RNA in development. Mammalian Genome. 2008, 19 (7–8): 454-92. 10.1007/s00335-008-9136-7.View ArticlePubMedGoogle Scholar
- Mercer TR, Dinger ME, Sunkin SM, Mehler MF, Mattick JS: Specific expression of long noncoding RNAs in the mouse brain. Proc Natl Acad Sci USA. 2008, 105 (2): 716-21. 10.1073/pnas.0706729105.PubMed CentralView ArticlePubMedGoogle Scholar
- Dinger ME, Amaral PP, Mercer TR, Pang KC, Bruce SJ, Gardiner BB, Askarian-Amiri ME, Ru K, Soldà G, Simons C, Sunkin SM, Crowe ML, Grimmond SM, Perkins AC, Mattick JS: Long noncoding RNAs in mouse embryonic stem cell pluripotency and differentiation. Genome Res. 2008, 18 (9): 1433-45. 10.1101/gr.078378.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Campbell PJ, Stephens PJ, Pleasance ED, O'Meara S, Li H, Santarius T, Stebbings LA, Leroy C, Edkins S, Hardy C, Teague JW, Menzies A, Goodhead I, Turner DJ, Clee CM, Quail MA, Cox A, Brown C, Durbin R, Hurles ME, Edwards PA, Bignell GR, Stratton MR, Futreal PA: Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat Genet. 2008, 40 (6): 722-9. 10.1038/ng.128.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen W, Kalscheuer V, Tzschach A, Menzel C, Ullmann R, Schulz MH, Erdogan F, Li N, Kijas Z, Arkesteijn G, Pajares IL, Goetz-Sothmann M, Heinrich U, Rost I, Dufke A, Grasshoff U, Glaeser B, Vingron M, Ropers HH: Mapping translocation breakpoints by next-generation sequencing. Genome Res. 2008, 18 (7): 1143-9. 10.1101/gr.076166.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing X, Sam L, Barrette T, Palanisamy N, Chinnaiyan AM: Transcriptome sequencing to detect gene fusions in cancer. Nature. 2009, 458 (7234): 97-101. 10.1038/nature07638.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhao Q, Caballero OL, Levy S, Stevenson BJ, Iseli C, de Souza SJ, Galante PA, Busam D, Leversha MA, Chadalavada K, Rogers YH, Venter JC, Simpson AJ, Strausberg RL: Transcriptome-guided characterization of genomic rearrangements in a breast cancer cell line. Proc Natl Acad Sci USA. 2009, 106 (6): 1886-91. 10.1073/pnas.0812945106. Epub 2009 Jan 30PubMed CentralView ArticlePubMedGoogle Scholar
- Kim N, Kim P, Nam S, Shin S, Lee S: ChimerDB – a knowledgebase for fusion sequences. Nucleic Acids Res. 2006, D21-4. 10.1093/nar/gkj019. 34 Database
- Kim N, Alekseyenko AV, Roy M, Lee C: The ASAP II database: analysis and comparative genomics of alternative splicing in 15 animal species. Nucleic Acids Res. 2007, D93-8. 10.1093/nar/gkl884. 35 Database
- Mitelman F, Johansson B, Mertens F: The impact of translocations and gene fusions on cancer causation. Nat Rev Cancer. 2007, 7 (4): 233-45. 10.1038/nrc2091.View ArticlePubMedGoogle Scholar
- Ritchie W, Granjeaud S, Puthier D, Gautheret D: Entropy measures quantify global splicing disorders in cancer. PLoS Comput Biol. 2008, 4 (3):
- Afify A, Pang L, Howell L: Diagnostic utility of CD44 standard, CD44v6, and CD44v3-10 expression in adenocarcinomas presenting in serous fluids. Appl Immunohistochem Mol Morphol. 2007, 15 (4): 446-50. 10.1097/01.pai.0000213154.49063.22.View ArticlePubMedGoogle Scholar
- Panzitt K, Tschernatsch MM, Guelly C, Moustafa T, Stradner M, Strohmaier HM, Buck CR, Denk H, Schroeder R, Trauner M, Zatloukal K: Characterization of HULC, a novel gene with striking up-regulation in hepatocellular carcinoma, as noncoding RNA. Gastroenterology. 2007, 132 (1): 330-42. 10.1053/j.gastro.2006.08.026.View ArticlePubMedGoogle Scholar
- Yamada K, Kano J, Tsunoda H, Yoshikawa H, Okubo C, Ishiyama T, Noguchi M: Phenotypic characterization of endometrial stromal sarcoma of the uterus. Cancer Sci. 2006, 97 (2): 106-12. 10.1111/j.1349-7006.2006.00147.x.View ArticlePubMedGoogle Scholar
- Luo JH, Ren B, Keryanov S, Tseng GC, Rao UN, Monga SP, Strom S, Demetris AJ, Nalesnik M, Yu YP, Ranganathan S, Michalopoulos GK: Transcriptomic and genomic analysis of human hepatocellular carcinomas and hepatoblastomas. Hepatology. 2006, 44 (4): 1012-24. 10.1002/hep.21328.PubMed CentralView ArticlePubMedGoogle Scholar
- Lin R, Maeda S, Liu C, Karin M, Edgington TS: A large noncoding RNA is a marker for murine hepatocellular carcinomas and a spectrum of human carcinomas. Oncogene. 2007, 26 (6): 851-8. 10.1038/sj.onc.1209846.View ArticlePubMedGoogle Scholar
- Calin GA, Liu CG, Ferracin M, Hyslop T, Spizzo R, Sevignani C, Fabbri M, Cimmino A, Lee EJ, Wojcik SE, Shimizu M, Tili E, Rossi S, Taccioli C, Pichiorri F, Liu X, Zupo S, Herlea V, Gramantieri L, Lanza G, Alder H, Rassenti L, Volinia S, Schmittgen TD, Kipps TJ, Negrini M, Croce CM: Ultraconserved regions encoding ncRNAs are altered in human leukemias and carcinomas. Cancer Cell. 2007, 12 (3): 215-29. 10.1016/j.ccr.2007.07.027.View ArticlePubMedGoogle Scholar
- Margulies M, et al: Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005, 437 (7057): 376-80.PubMed CentralPubMedGoogle Scholar
- Zhulidov PA, Bogdanova EA, Shcheglov AS, Vagner LL, Khaspekov GL, Kozhemyako VB, Matz MV, Meleshkevitch E, Moroz LL, Lukyanov SA, Shagin DA: Simple cDNA normalization using kamchatka crab duplex-specific nuclease. Nucleic Acids Research. 2004, 32 (3): e37-10.1093/nar/gnh031.PubMed CentralView ArticlePubMedGoogle Scholar
- Grillo G, Attimonelli M, Liuni S, Pesole G: CLEANUP: a fast computer program for removing redundancies from nucleotide sequence databases. Computer Appl Biosci. 1996, 12: 1-8.Google Scholar
- Huang X, Madan A: CAP3: A DNA sequence assembly program. Genome Res. 1999, 9 (9): 868-77. 10.1101/gr.9.9.868.PubMed CentralView ArticlePubMedGoogle Scholar
- Kent WJ: BLAT – the BLAST-like alignment tool. Genome Res. 2002, 12 (4): 656-64.PubMed CentralView ArticlePubMedGoogle Scholar
- Kuhn RM, Karolchik D, Zweig AS, Trumbower H, Thomas DJ, Thakkapallayil A, Sugnet CW, Stanke M, Smith KE, Siepel A, Rosenbloom KR, Rhead B, Raney BJ, Pohl A, Pedersen JS, Hsu F, Hinrichs AS, Harte RA, Diekhans M, Clawson H, Bejerano G, Barber GP, Baertsch R, Haussler D, Kent WJ: The UCSC genome browser database: update 2007. Nucleic Acids Res. 2007, D668-73. 10.1093/nar/gkl928. 35 Database
- Boccia A, Petrillo M, di Bernardo D, Guffanti A, Mignone F, Confalonieri S, Luzi L, Pesole G, Paolella G, Ballabio A, Banfi S: DG-CST (Disease Gene Conserved Sequence Tags), a database of human-mouse conserved elements associated to disease genes. Nucleic Acids Res. 2005, D505-10. 33 Database
- Mignone F, Anselmo A, Donvito G, Maggi GP, Grillo G, Pesole G: Genome-wide identification of coding and non-coding conserved sequence tags in human and mouse genomes. BMC Genomics. 2008, 9 (1): 277-10.1186/1471-2164-9-277.PubMed CentralView ArticlePubMedGoogle Scholar
- Pang KC, Stephen S, Dinger ME, Engstrom PG, Lenhard B, Mattick JS: RNAdb 2.0 – an expanded database of mammalian non-coding RNAs. Nucleic Acids Res. 2007, D178-82. 10.1093/nar/gkl926. 35 Database
- He S, Liu C, Skogerbø G, Zhao H, Wang J, Liu T, Bai B, Zhao Y, Chen R: NONCODE v2.0: decoding the non-coding. Nucleic Acids Res. 2008, D170-2. 36 Database
- ENCODE Project Consortium: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007, 447 (7146): 799-816. 10.1038/nature05874.View ArticleGoogle Scholar
- Zhang X, Ding L, Sandford AJ: Selection of reference genes for gene expression studies in human neutrophils by real-time PCR. BMC Mol Biol. 2005, 6 (1): 4-10.1186/1471-2199-6-4.PubMed CentralView ArticlePubMedGoogle Scholar
- Stéphane Audic, and Jean-Michel: The Significance of Digital Gene Expression Profiles. Genome Research. 1997, 7 (10): 986-995.Google Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5 (7): 621-8. 10.1038/nmeth.1226.View ArticlePubMedGoogle Scholar
- Eveland AL, McCarty DR, Koch KE: Transcript profiling by 3'-untranslated region sequencing resolves expression of gene families. Plant Physiol. 2008, 146 (1): 32-44. 10.1104/pp.107.108597.PubMed CentralView ArticlePubMedGoogle Scholar
- Mignone F, Grillo G, Licciulli F, Iacono M, Liuni S, Kersey PJ, Duarte J, Saccone C, Pesole G: UTRdb and UTRsite: a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs. Nucleic Acids Res. 2005, D141-6. 33 Database
- Di Segni G, Gastaldi S, Tocchini-Valentini GP: Cis- and trans-splicing of mRNAs mediated by tRNA sequences in eukaryotic cells. Proc Natl Acad Sci USA. 2008, 105 (19): 6864-9. 10.1073/pnas.0800420105.PubMed CentralView ArticlePubMedGoogle Scholar
- Huh KW, DeMasi J, Ogawa H, Nakatani Y, Howley PM, Münger K: Association of the human papillomavirus type 16 E7 oncoprotein with the 600-kDa retinoblastoma protein-associated factor, p600. Proc Natl Acad Sci USA. 2005, 102 (32): 11492-7. 10.1073/pnas.0505337102.PubMed CentralView ArticlePubMedGoogle Scholar
- Ruan Y, Ooi HS, Choo SW, Chiu KP, Zhao XD, Srinivasan KG, Yao F, Choo CY, Liu J, Ariyaratne P, Bin WG, Kuznetsov VA, Shahab A, Sung WK, Bourque G, Palanisamy N, Wei CL: Fusion transcripts and transcribed retrotransposed loci discovered through comprehensive transcriptome analysis using Paired-End diTags (PETs). Genome Res. 2007, 17 (6): 828-38. 10.1101/gr.6018607.PubMed CentralView ArticlePubMedGoogle Scholar
- Venables JP, Klinck R, Bramard A, Inkel L, Dufresne-Martin G, Koh C, Gervais-Bird J, Lapointe E, Froehlich U, Durand M, Gendron D, Brosseau JP, Thibault P, Lucier JF, Tremblay K, Prinos P, Wellinger RJ, Chabot B, Rancourt C, Elela SA: Identification of alternative splicing markers for breast cancer. Cancer Res. 2008, 68 (22): 9525-31. 10.1158/0008-5472.CAN-08-1769.View ArticlePubMedGoogle Scholar
- Boudreau N, Myers C: Breast cancer-induced angiogenesis: multiple mechanisms and the role of the microenvironment. Breast Cancer Res. 2003, 5 (3): 140-6. 10.1186/bcr589.PubMed CentralView ArticlePubMedGoogle Scholar
- Xu Q, Modrek B, Lee C: Genome-wide detection of tissue-specific alternative splicing in the human transcriptome. Nucleic Acids Res. 2002, 30 (17): 3754-66. 10.1093/nar/gkf492.PubMed CentralView ArticlePubMedGoogle Scholar
- Frith MC, Bailey TL, Kasukawa T, Mignone F, Kummerfeld SK, Madera M, Sunkara S, Furuno M, Bult CJ, Quackenbush J, Kai C, Kawai J, Carninci P, Hayashizaki Y, Pesole G, Mattick JS: Discrimination of non-protein-coding transcripts from protein-coding mRNA. RNA Biol. 2006, 3 (1): 40-8.View ArticlePubMedGoogle Scholar
- Huse SM, Huber JA, Morrison HG, Sogin ML, Welch DM: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol. 2007, 8 (7): R143-10.1186/gb-2007-8-7-r143.PubMed CentralView ArticlePubMedGoogle Scholar
- Leygue E: Steroid receptor RNA activator (SRA1): unusual bifaceted gene products with suspected relevance to breast cancer. Nucl Recept Signal. 2007, 5: e006-PubMed CentralPubMedGoogle Scholar
- Wilusz JE, Freier SM, Spector DL: 3'End Processing of a Long Nuclear-Retained Noncoding RNA Yields a tRNA-like Cytoplasmic RNA. Cell. 2008, 135 (5): 919-932. 10.1016/j.cell.2008.10.012.PubMed CentralView ArticlePubMedGoogle Scholar
- Praz V, Jagannathan V, Bucher P: CleanEx: a database of heterogeneous gene expression data based on a consistent gene nomenclature. Nucleic Acids Res. 2004, 32: D542-7. 10.1093/nar/gkh107.PubMed CentralView ArticlePubMedGoogle Scholar
- Loi S, Haibe-Kains B, Desmedt C, Lallemand F, Tutt AM, Gillet C, Ellis P, Harris A, Bergh J, Foekens JA, Klijn JG, Larsimont D, Buyse M, Bontempi G, Delorenzi M, Piccart MJ, Sotiriou C: Definition of clinically distinct molecular subtypes in estrogen receptor-positive breast carcinomas through genomic grade. J Clin Oncol. 2007, 25 (10): 1239-46. 10.1200/JCO.2006.07.1522.View ArticlePubMedGoogle Scholar
- Loi S, Haibe-Kains B, Desmedt C, Wirapati P, Lallemand F, Tutt AM, Gillet C, Ellis P, Ryder K, Reid JF, Daidone MG, Pierotti MA, Berns EM, Jansen MP, Foekens JA, Delorenzi M, Bontempi G, Piccart MJ, Sotiriou C: Predicting prognosis using molecular profiling in estrogen receptor-positive breast cancer treated with tamoxifen. BMC Genomics. 2008, 9: 239-10.1186/1471-2164-9-239.PubMed CentralView ArticlePubMedGoogle Scholar
- Betel D, Sheridan R, Marks DS, Sander C: Computational analysis of mouse piRNA sequence and biogenesis. PLoS Comput Biol. 2007, 3 (11): e222-10.1371/journal.pcbi.0030222.PubMed CentralView ArticlePubMedGoogle Scholar
- Badger JH, Olsen GJ: CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol. 1999, 16 (4): 512-24.View ArticlePubMedGoogle Scholar
- Guffanti A, Reid JF, Alcalay M, Simon G: The meaning of it all: web-based resources for large-scale functional annotation and visualization of DNA microarray data. Trends Genet. 2002, 18 (11): 589-92. 10.1016/S0168-9525(02)02795-6.View ArticlePubMedGoogle Scholar
- Sherman BT, Huang da W, Tan Q, Guo Y, Bour S, Liu D, Stephens R, Baseler MW, Lane HC, Lempicki RA: DAVID Knowledgebase: a gene-centered database integrating heterogeneous gene annotation resources to facilitate high-throughput gene functional analysis. BMC Bioinformatics. 2007, 8: 426-10.1186/1471-2105-8-426.PubMed CentralView ArticlePubMedGoogle Scholar
- Muller G, Gaspin C, Etienne A, Westhof E: Automatic display of RNA secondary structures. Comput Appl Biosci. 1993, 9 (5): 551-61.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.