The Retinome – Defining a reference transcriptome of the adult mammalian retina/retinal pigment epithelium

Background The mammalian retina is a valuable model system to study neuronal biology in health and disease. To obtain insight into intrinsic processes of the retina, great efforts are directed towards the identification and characterization of transcripts with functional relevance to this tissue. Results With the goal to assemble a first genome-wide reference transcriptome of the adult mammalian retina, referred to as the retinome, we have extracted 13,037 non-redundant annotated genes from nearly 500,000 published datasets on redundant retina/retinal pigment epithelium (RPE) transcripts. The data were generated from 27 independent studies employing a wide range of molecular and biocomputational approaches. Comparison to known retina-/RPE-specific pathways and established retinal gene networks suggest that the reference retinome may represent up to 90% of the retinal transcripts. We show that the distribution of retinal genes along the chromosomes is not random but exhibits a higher order organization closely following the previously observed clustering of genes with increased expression. Conclusion The genome wide retinome map offers a rational basis for selecting suggestive candidate genes for hereditary as well as complex retinal diseases facilitating elaborate studies into normal and pathological pathways. To make this unique resource freely available we have built a database providing a query interface to the reference retinome [1].


Background
The mammalian retina is a highly structured tissue developmentally originating from neuroectodermal evagination of the diencephalon and subsequent invagination processes resulting in the formation of two cellular layers which ultimately give rise to the inner neural retina and the outer retinal pigment epithelium (RPE) monolayer [2]. In the adult, the neural retina consists of approximately 55 distinct cell types histologically structured into three layers of cells (photoreceptors, intermediate neurons and ganglion cells) and two layers of neuronal interconnections (outer and inner plexiform layers) [3]. The RPE is differentiated into polarized cells with an apical and a basal orientation separating the neural retina from the underlying choroidal blood supply. With its apical microvilli-like processes, the RPE establishes an intimate contact with the photoreceptor outer segments to sustain their metabolic support and maintain photoreceptor integrity [4]. Together, the neural retina and the RPE provide the structural and functional basis for light perception by ensuring the capture of photons, the conversion of light stimuli into complex patterns of neuronal impulses and the transmission of the initially processed signals to the higher visual centers of the brain.
Recent progress in retinal research has greatly enhanced our current understanding of basic functional processes in the adult retina (e.g. [4,5]). A great deal of effort has focused on the molecular dissection of the phototransduction pathway and the retinoid cycle (e.g. ref. [6]). Besides elucidating physiological mechanisms in normal tissue, the identification of genes involved in hereditary retinal disease has provided another valuable source of insight into functional pathways of the retina and the RPE (reviewed in [7,8]).
Despite these advances, a remaining challenge is to obtain a reference genome-wide expression map of the retina/ RPE transcriptome, further facilitating the identification of retinal susceptibility genes, but most importantly, offering an invaluable resource for functional genomics studies. Initial analyses of human [9,10] and mouse [11] whole genome sequences and the use of more recent comparative gene prediction algorithms [12,13] suggest an overall number of mammalian gene loci in the range of 35,000 to 45,000. These estimates have largely been validated by experimental data on gene transcription [14,15] although alternative promoter usage, differential exon splicing during mRNA maturation, alternative usage of polyadenylation sites and other post-transcriptional modifications may further increase the genetic diversity required to encode the full complement of cellular transcripts [16,17]. In addition, there may be a considerable number of non-coding genes unaccounted for by current annotations [18].
In recent years, a number of approaches and technologies were adopted to identify genes expressed in the retina/RPE of human, cow, dog and mouse including data-mining and assembly of publically available expressed sequence tag (EST) information [19][20][21][22][23], sequencing of cDNA libraries generated via conventional methods [24][25][26][27][28][29] or via normalization techniques [30,31], hybridization to gene arrays of various formats [32,33] and serial analysis of gene expression (SAGE) [34,35]. Suppression subtractive hybridization (SSH) has been shown to be an efficient technique with which differentially expressed genes can be normalized and enriched over 1000-fold in a single round of hybridization [36]. Subsequently, applications of SSH to identify retina and RPE-enriched genes have been reported [37][38][39].
Based on a comprehensive survey of data available from 27 independent studies applying a wide spectrum of gene identification approaches we have now assembled a first genome-wide reference transcriptome of the adult mammalian retina/RPE. This reference transcriptome comprises 13,037 non-redundant transcripts and likely reflects up to 90% of the mammalian retinome.

Results
A total of 481,137 primary datasets on gene transcripts from the adult mammalian retina/RPE tissues have been generated in 27 independent studies (Table 1). Of these, 52,630 datasets (31,814 from retina, 11,632 from RPE and 9,184 from retina/RPE) were available and attributable to unique LocusLink identifiers (IDs). Correcting for gene redundancy within and between studies yielded a catalogue of 15,645 retinal/RPE genes. A survey of incidence and origin of each of these genes in the various studies analyzed demonstrated that 2,608 transcripts were found only once (see additional File 1) while the remaining 13,037 genes (see additional File 2) were confirmed in at least two and up to 16 independent gene identification approaches (Table 2). Thus, the latter compilation of genes may represent a more conservative description of the retinome minimizing a potential bias in data ascertainment. Of the 13K retinome, 1,411 genes were solely identified in retinal studies (see additional File 3) while 246 genes were exclusively found in the RPE datasets (see additional File 4).
To assess the degree of completeness of the adult mammalian retinome, we compared the LocusLink IDs of the 13,037 transcripts to partial lists of genes known i) to be specifically expressed in the retina/RPE (category I, n = 43) (see additional File 5), ii) to play a role in the phototransduction pathway/vitamin A cycle (category II, n = 57) (see additional File 6), iii) to encode retinal/RPE proteins verified by immunohistochemistry (category III, n = 260) (see additional File 7), and iv) to be associated with syndromic and non-syndromic retinal disease (category IV, n = 102) (see additional File 8). The data show that the compiled retinome covers all retina/RPE-specific transcripts (43/43) and 53/57 (93%) of the phototransduction pathway/vitamin A cycle genes. Known retinal/RPE proteins are represented by 204/260 (79%) transcripts while 87/102 (85%) genes known to be involved in retinal diseases are found in the 13K retinome collection (Table 3). To further evaluate the significance of these findings, partial transcriptomes of heart (n = 3,660; see additional File 9), liver (n = 5,780; see additional File 10) and prostate (n = 7,018; see additional File 11) were assembled and compared to the four selected categories.
A comparison of the 13K retinome with partial transcriptomes of heart, liver, and prostate suggests a high degree of overlapping expression between retina/RPE and heart (3,496/3,660), liver (5,343/5,780) and prostate (6,471/ 7,018). A total of 2,330 genes are expressed in all tissues and represent putative "housekeeping" genes (see additional File 12). It should be noted that the low number of ubiquitously expressed genes is largely due to the fragmentary nature of the heart, liver, and prostate transcriptomes. With increasing transcriptome complexities this number is likely to increase. Analysis of the least complete transcriptome, the heart, reveals that 2,330/3,660 (64%)

No. of genes identified in ≥ 2 studies
In-silico projects [20] Retina, TIGR (ID: version 3. transcripts can be classified as ubiquitously expressed (see additional Files 9 and 12) while a maximum of 1,330/ 3,660 (36%) genes may display tissue-restricted or tissuespecific expression. A comparison of more complete transcriptomes may significantly reduce the latter estimate. So far 5,051 genes are only found in the retinome representing a collection of "retinome-enriched" transcripts, while 7,986 are also present in at least one of the partial transcriptomes of the heart, liver or prostate. Thirty-two genes were found to be expressed in heart, liver and prostate but not in the retinome (see additional File 13).  Fig. 1a,1b).
To provide positional candidates for syndromic and nonsyndromic hereditary retinopathies, the 13K reference retinome as well as the "retinome-enriched" transcripts (5,051 transcripts) were superimposed onto the disease intervals of 42 thus far uncloned retinal disorders (Table  4). In many instances, this results in a significant reduction of genes in the respective intervals offering a manageable number of candidates for retinal diseases (e.g. the RP29 locus contains 28 SGPs of which 5 are present in the reference retinome including GPM6A, WDR17, FLJ22649, VEGFC, AGA). The number of possible candidates is further reduced in the "retinome-enriched" transcript category to GPM6A and VEGFC. To make the information on the reference retinome available, we have created the interactive RetinaCentral database, a research portal which collects and stores information on genes and proteins functionally relevant to the retinal tissues [1]. We have implemented an interactive data retrieval system that presently contains linked information on the 13,037 genes of the 13K reference retinome. Database scripts were programmed to synchronize the data with LocusLink index files [41] which are updated daily [42].

Discussion
Compiling the transcriptome of a cell or tissue is arguably more demanding than establishing the number of gene loci encoded by a given genome sequence [43]. This may mainly be explained by the dynamic nature of mRNA itself which frequently produces alternative transcripts from a single gene locus by usage of tissue-specific promoters, cryptic splice sites or variable polyadenylation signals [44,45]. In addition, variation in gene expression is known to occur within and between populations [46,47] and allele-specific expression, even from non-imprinted genes, appears to be common [48]. Further complicating transcriptome definition are effects of gender and age on RNA expression [49] as well as agonal and postmortem factors which greatly affect RNA integrity and thus frequently influence subsequent analyses [50]. Finally, differences in experimental technologies and data postprocessing add an additional level of variability. Taken together, the complexities in mRNA metabolism and experimental data handling strongly suggest that there is not a single transcriptome for a given cell or tissue but implies an arbitrary number of individual transcriptomes which need to be defined by a series of parameters such as age, gender, ethnicity, cause and time of death of the tis-Chromosomal distribution of transcripts defining the reference retinome sue donor besides many others. It is therefore advisable to initially aim for a reference transcriptome providing a blueprint of an expression profile within a broadly defined time-frame. Following this line of reasoning, we here present a framework of a first reference transcriptome of the retina/RPE consisting of 13,037 unique transcripts which broadly characterize the mature state of expression in this tissue. The present meta-analysis has integrated information from 27 studies employing diverse technologies to identify retinal/RPE transcrips. Among these, SAGE represents a sensitive tool to detect low level transcription [51] while the PCR-based SSH method is well suited to enrich for differentially expressed genes [36]. The combined use of these approaches together with conventional cDNA library sequencing and microarray-based techniques provides a more solid assessment of gene expression than would each method alone. For example, SAGE is based on sequencing of hundreds of thousands of short (10, 14, or 21 bp) tags, ideally derived from a unique location of a single transcript. Rare tags could originate from infrequently expressed transcripts but could also reflect minor genomic contamination or minor sequencing errors. For the assembly of the reference retinome we have addressed these concerns by including only those transcripts that have independently been confirmed in a second unrelated study. This has led to a conservative assembly of the 13K retinome. It should be kept in mind however that this proceeding likely excludes a number of authentic transcripts. This is illustrated by the finding that the 15K retinome which comprises 15,645 transcripts including those which were solely found in a single study (Table 2), contains an additional five of the 102 known retinal disease genes (RHOK, MTATP6, CHM, LRAT, RIMS1) not included in the 13K retinome. Similarly, an additional three genes (RHOK, LRAT, GPRK7) involved in the vitamin A/phototransduction pathway are part of the 15K but not the 13K retinome. With additional transcription data on the retina/RPE becoming available, a second generation retinome map will need to address this issue.
The estimation of transcriptome size represents one of the fundamental questions in molecular biology. Early studies using reassociation kinetics have calculated the number of distinct mRNA transcripts present in various mouse tissues to be between 11,500 and 12,500 [52]. Initial SAGE analyses have led to the conclusion that the number of different transcripts observed in normal and tumorous tissue may lie between 14,247 and 20,471 [53]. Recent data from comprehensive EST sequencing of a number of tissues including brain, breast, colon, head/neck, kidney lung, ovary, prostate, and uterus suggest expression of between 7,500 and 13,500 distinct genes for each tissue [54]. Although the size of the reference retinome is consistent with these estimates, the question of adequate transcript representation by the current compilation remains open. We have addressed this by defining a number of gene groups with known expression in retina/RPE and comparing these to the reference retinome. Genes exclusively expressed in retina/RPE are highly represented in the retinome (100%), as are mainly tissue-specific genes known to play a role in the vitamin A/phototransduction pathway (93%) ( Table 3). A partial list of 260 genes whose encoded proteins were shown by immunohistochemistry to be expressed in the retina/RPE (but may also be present in other tissues), were represented in the reference retinome at a rate of approximately 79%. Similar numbers were obtained for the retinome coverage of retinal disease genes (85%). From these data we conclude that the 13K reference retinome is highly representative of retina/RPE-expressed genes and may describe as much as 90% of the transcript complement in the adult state.
Another point of interest concerns the proportion of retinome transcripts which is uniquely expressed in this tissue. Brentani et al. [54] estimate that any two tissues may share between 73% and 84% of their transcriptomes. Comparing transcription in three tissues (breast, colon, head/neck) the authors found overlapping expression in 47% of transcripts. To investigate this in more detail, we have compiled three partial transcriptomes from heart (n = 3,660), liver (n = 5,780) and prostate (n = 7,018) by applying the same stringent criteria as defined for the retinome. Limited by the size of the partial heart transcriptome, we determined 2,330 transcripts (termed "housekeeping" genes) to be expressed in all four tissues (i.e. 64% of the heart transcriptome). Comparing the retinome to any of the partial transcriptomes revealed overlapping gene profiles between 92 % and 95 %. This would suggest that only a minor proportion of retinome transcripts is indeed unique to the retina/ RPE. Thus far, we have identified a group of so called "retinome-enriched" genes comprising 5,051 transcripts which are not present in the partial transcriptomes of heart, liver and prostate. This group most likely contains additional "housekeeping" or tissue-restricted transcripts and needs further adjustment by more refined in-silico normalization to comprehensive reference transcriptomes of other tissues.
Highly expressed genes including those with a ubiquitous or a tissue-specific transcription profile, have been shown to cluster in chromosomal regions of increased gene expression (termed RIDGEs) [55,56]. Functionally, this higher order structure has been related to transcriptional regulation [56,57]. To search for a possible correlation, we have determined the chromosomal distribution of the reference retinome independent of gene density. Our data show good agreement with the previously established regional expression map defining approximately 30 RIDGEs within the human genome. Overlaps are most evident for chromosomes 6, 9, 11, 17, and 19. From this we conclude that the majority of transcripts assembled in the reference retinome share characteristics of the RIDGEs including moderate to high level expression. This finding may be ascribed to the stringent selection criteria we have applied to assemble the reference retinome by excluding all transcripts (n = 2,608) that were reported in only a single study. Conversely, the RIDGE-like pattern of the reference retinome could be an indication that missing transcripts may have features compatible with chromo-somal domains defined as anti-RIDGEs [56]. As opposed to RIDGEs, clustering of genes in anti-RIDGEs seems associated with significant decreased expression [56]. In contrast to their fractional occurrence in transcriptomes, the identification of such low abundant transcripts are likely to require significant resources in order to compile more complete transcriptomes.
To provide positional candidates for retinal disease genes, we have mapped the transcripts representing the reference retinome to the minimal regions defined for 42 retinal disease loci with as yet undefined gene mutations. To further limit the number of candidate genes, in particular for loosely defined disease loci such as RP28 or VRNI, we have similarly integrated the "retinome-enriched" transcripts. This also accommodates for the fact that approximately 50% of retinal disease genes are retina/RPEspecific [58].

Conclusions
We here present a first near-complete transcriptome of a defined tissue, the retinome, which may serve as a reference for further efforts to establish spatial, i.e. cellspecific, and developmental transcriptomes of the retina/ RPE. A fundamental aspect of the current study was to integrate the available information on gene identification generated by a wide range of techniques. This ensures robustness and reliability of transcript data providing a stringent framework for further expression studies in systems biology. A similar approach for other tissues/cells would be advisable as this may greatly facilitate in-silico identification of tissue-specific genes to elucidate functional pathways vital for a defined cell population. In addition, the reference retinome may prove valuable for providing strong candidates for hereditary as well as genetically complex diseases and thus may help to further our understanding of retinal biology in health and disease. To assemble partial transcriptomes of heart, liver and prostate, for each tissue data were mined from at least one SAGE library, in addition to expressed sequence tag (EST) sources (see additional File 16). Similar to the criteria for the assembly of the retinome, genes identified in only one study were disregarded. EST retrieval was facilitated by use of the Gene Library Summarizer [63] which retrieves the known genes represented by at least one EST and generated from a tissue sample with normal histology.

Data retrieval and analysis
Partial lists of genes known to play a role in the retina and/or the RPE were assembled from the literature (see additional Files 5, 6, 7 and 8). Additional File 5 summarizes genes known to be exclusively expressed in retina and/or RPE, while additional File 6 includes genes involved in the phototransduction cascade and the vitamin A cycle. Additional File 7 is a partial compilation of genes/proteins verified by immunohistochemistry to be present in adult mammalian retina and/or RPE. A list of 102 genes involved in retinal diseases was retrieved from the RetNet database, January 2004 [58] (see additional File 8).

Assignment of genes and disease loci to the human genome
A total of 43,109 human non-redundant syntenic gene predictions (SGP) were retrieved (as of December 2003) and chromosomally mapped to the reference sequence of the human genome (July 2003) utilizing the USCS Genome Table Browser [64]. Based on the position of their putative transcription start sites, the SGPs were assigned to 5 Mb bins along the human chromosomes. In addition, one-megabase bins were defined for refined analysis of chromosome 6 and 19 (Fig. 1b). Similarly, the chromosomal map positions of the retinome transcripts were determined by querying the USCS Genome Table  Browser with the respective LocusLink, UniGene or Ref-Seq IDs.
Mapped loci of retinal dystrophies with unknown genetic basis (n = 45) were taken from RetNet, January 2004 [58] and placed on the human genome sequence by querying the USCS Genome Table Browser with DNA marker sequences shown to flank the minimal candidate region. Three disease loci (CORD1, CORD4 and RCD1) are insufficiently mapped on the respective human chromosomes and were therefore not included in the analysis.

Statistical analysis of gene distribution
To determine if either of the two datasets, the 43,109 human non-redundant SGPs and the 13K retinome transcripts, is distributed in a non-parametric and distribution free manner over the genome, the Kolmogorov-Smirnov Goodness-of-Fit Test was used [65]. Statistical significance of the median difference in paired chromosomal distribution of retinome transcripts versus the SGPs was then evaluated by the non-parametric Wilcoxon two-sample paired signed rank test [66]. To carry out the test we calculated the difference between all genes versus retinal genes per 5-Mb bin. To correct for the total number of genes within the two groups, the SGPs per bin were adjusted by a factor of 13,037/43,109 = 0. 30