Comprehensive in silico functional specification of mouse retina transcripts

Background The retina is a well-defined portion of the central nervous system (CNS) that has been used as a model for CNS development and function studies. The full specification of transcripts in an individual tissue or cell type, like retina, can greatly aid the understanding of the control of cell differentiation and cell function. In this study, we have integrated computational bioinformatics and microarray experimental approaches to classify the tissue specificity and developmental distribution of mouse retina transcripts. Results We have classified a set of retina-specific genes using sequence-based screening integrated with computational and retina tissue-specific microarray approaches. 33,737 non-redundant sequences were identified as retina transcript clusters (RTCs) from more than 81,000 mouse retina ESTs. We estimate that about 19,000 to 20,000 genes might express in mouse retina from embryonic to adult stages. 39.1% of the RTCs are not covered by 60,770 RIKEN full-length cDNAs. Through comparison with 2 million mouse ESTs, spectra of neural, retinal, late-generated retinal, and photoreceptor -enriched RTCs have been generated. More than 70% of these RTCs have data from biological experiments confirming their tissue-specific expression pattern. The highest-grade retina-enriched pool covered almost all the known genes encoding proteins involved in photo-transduction. Conclusion This study provides a comprehensive mouse retina transcript profile for further gene discovery in retina and suggests that tissue-specific transcripts contribute substantially to the whole transcriptome.


Background
The retina is a well-defined portion of the Central Nervous System (CNS) that has long been used as a model for CNS development and function [1][2][3][4]. It is susceptible to a variety of diseases that can lead to vision loss or complete blindness. Most of the unique functions of the retina depend upon its tissue-specific transcript sets, suggesting that a systematic definition of retinal transcripts would be an invaluable approach to understanding retinal cell identities and functions.
The complete genome sequences of human and mouse provide a new starting point for understanding specific expressed transcripts, especially the sequences associated with development and disease. Expressed sequence tag (ESTs) databases, are the most abundant resource of gene expression data. Recently, Okazaki et al. used available data on ESTs to establish a comprehensive full-length transcript data base [5]. There have been a number of studies of retinal transcripts [6][7][8][9][10] and initial databases listing some of the retinal transcripts ( [11]; http:// neibank.nei.nih.gov/index.shtml; [12]; http:// www.umich.edu/~igene/), but none of these have provided a global view of retinal transcripts. Recently, Blackshaw et al used serial analysis of gene expression (SAGE) identified 1,051 genes that showed developmentally dynamic expression in mouse retina [13]. Schulz and coworkers analyzed a set of retina transcripts from a mixed population of different datasets and suggested that about 13,000 transcripts might describe 90% of the adult retinome [14]. Although a systemic analysis of mouse retina ESTs has recently been reported [15,16], the functional specification of retina transcripts has not. Thus, this study provides a complementary view of mouse retina transcripts.
In the present study, we have generated mouse retina ESTs from embryonic day 13.5 (E13.5), postnatal day 1 (PN1), and adult (8 weeks old) and then analyzed about 81,000 ESTs along with other three major mouse retina libraries through an approach that integrates computational and retina tissue-specific microarray data to identify a set of candidate genes highly related to retina-specific function and retinal diseases. 33,737 non-redundant sequences were identified as retina transcript clusters (RTCs) and step by step classified into neural, retinal, late-generated retinal, and photoreceptor-enriched RTCs. This study also provides a comprehensive table of mouse retina transcript profiles that will now allow a better understanding of retinal development and function.

Purification and specification of mouse RTCs
We used a series of computational steps (Fig. 1, Additional data file 1 and 2) to clean up and reorganize a total of 81,253 mouse retina ESTs from the NCBI database (October 2002). The starting pool of retina ESTs was generated from the total retina ESTs by subtraction of a population (4,848 or 5.9% of the total) containing repeat sequences, fusion sequences, low sequence quality, vector sequences, very short sequences, mitochondrial sequences, and sequences with no BLAST-hit in the mouse genome (data not shown). 33,737 ESTs from this starting pool of 76,467 were identified by applying a program that searched for non-redundant sequences ( Fig. 2A). We have termed these Retina Transcript Clusters (RTCs). Figure 1 Scheme of computational and biological procedures. For computational analysis, five mouse EST sets were used (retina, cerebellum, lung, CNS, and whole EST). All ESTs were filtered through a non-redundant procedure (details in additional data file 1 and 2). The whole set of RTCs representing all retina transcripts was called RTC-All (RTC-A). Then RTC-A were subtracted with CNS and whole mouse ESTs to generate RTC-N and RCT-R pool. Cleaned cerebellum and lung EST pool were used as internal control. In microarray assays, tissues from PN21 mouse retina, brain, and other body regions were used for comparison of gene expression to verify RTC-R. Gene expression profiles from PN21/PN1 retina comparisons represented a set of genes involved in late-born retina development. By comparison of this set with RTC-R, RTC related to late-born retina cell development were generated (RTC-L). PN35 wild-type retina was used to compare with rd1 mutant retina at same age. Results from this subtraction represent a set of genes whose expression is associated with rod photoreceptors (RTC-P).  Fig. 2A). The whole set of RTCs was further analyzed by comparing their sequences (BLAST score > = 100) with 60,770 RIKEN mouse full-length cDNAs derived from multiple tissues [5]. 39.1% of the RTCs were not present in the RIKEN mouse full-length pool, indicating the existence of a substantial population with either unique full-length or uniquely spliced transcripts in retina.

Scheme of computational and biological procedures
To find criteria by which we could define tissue-enriched or tissue-specific transcript clusters we carried out a comparative analysis starting with approximately 106,000 lung and 80,000 cerebellum ESTs. These sequence pools were filtered using the same procedures as for retina to give transcript clusters (TCs) for each tissue. We then collected all the sets of mouse EST data and removed from them any set that would contain a contribution from retina, lung or cerebellum. This gave a set of data containing over 2 million ESTs. We then used this total EST dataset and the individual transcript clusters (TCs) to derive a ratio for the number of copies in each TC compared with Purification and classification of mouse retina ESTs the number in the total. Any EST in a TC that was not found in the total was set to a value of 100. An EST with 5 copies in a TC and 5 copies in the total would have a ratio of 1 and an EST with 10 copies in a TC and 5 in the total would have a ratio of 2. This analysis was carried out for each of the three TCs. We then asked at what ratio were ESTs in the RTC not found in either the LTC or the CTC using the subset of sequences for which fell into UniGene clusters. At a ratio of 0.3 and above only 1.79% of RTC were shared with lung and only 3 % with cerebellum ( Fig  2B). By increasing the ratio to 2, the overlap with lung fell to 0% and with cerebellum fell to 0.3%.
We have also examined abundance in the EST pool as a criterion for helping define tissue specific transcripts. As a reference for the validity of these criteria we tested the inclusion of a known set of 47 known photoreceptor genes. Using an enrichment ratio of 2 from the above analysis, we found that 70-80% of these known photoreceptor specific genes were included in the RTCs. By increasing the copy number to 2 or more or to 3 or more, the total pool of RTCs decreased from 9,179 to 2,629 and 826 respectively without any substantial loss of detection of the known photoreceptor specific genes ( Fig 2C). The pool of 826 RTCs is listed in Additional data file 3.
We also carried out this analysis by setting the copy number to > 3 and examining the effect of varying the enrichment ratios (Fig 2D). At this copy number, increasing the enrichment ratio above 0.5 only changed the coverage and total number of ESTs by small amounts. Trend tests of the reduction of RTC numbers associated with increased EST copy number were significant using photoreceptor related genes (P = 6.8 × 10 -59 ) as criteria. Together our analysis suggests that about 80% of tissuespecific or -enriched transcripts can be identified using the two criteria, a copy ratio of at least two in the specific tissue and at least three EST copies.

Biological approach in identification of enriched retina transcripts
We also tested the biological robustness of the computational data using mouse retina tissue-specific microarrays (9,216 spots and 7,612 UniGene clusters) for experimental confirmation. Two groups of experiments were designed to detect genes enriched in retina or photoreceptors by microarray analysis. In the first group, total RNA from postnatal day 21 (PN21) retina was isolated and compared with RNA from whole brain or pooled RNA from other organs, including heart, lung, spleen, liver, and kidney from the same animals, (designated as "body" in this study). The data from Retina/Body, Retina/Brain were analyzed individually and log2 ratio results of the two experiments are shown as a spot-plot graph (Fig. 3A). The Retina/Body ratio is represented along the Y-axis and the Retina/Brain ratio along the X-axis. When setting log2 > = 1 as positive ratio, a double positive population lying at the upper-right represents retina-enriched genes. Using the same protocol, for the second group of experiments we compared RNA from PN21 retina with RNA from PN1 retina as set 1 (PN21retina/PN1retina) and PN35 wildtype retina with PN35 rd-1 (photoreceptor-deficient) retina as set 2 (WTretina/rd1retina). In set 1 we are comparing retinas before and after the generation of rod photoreceptors and in set 2 we were comparing wild type retinas with retinas in which the rods had degenerated. Thus, when setting log2 > = 1 as positive ratio, a double positive gene population lay at the upper-right and was considered as rod-enriched genes (Fig. 3B). The microarray results were used to generate two gene-expression profiles specific for later born neurons of retina and photoreceptors, respectively (Additional data file 4).
Computationally enriched ESTs from retina (RTC-R), subtracted with whole body and neuron ESTs, were compared with the microarray data from Retina/Body and Retina/ Brain. The overlap between the two approaches was 70% under the microarray criterion of log2 ratio > = 1 (Fig. 3C, 316 RTCs, P = 6.1 × 10 -5 ) and 80% using the criterion of 1.5 (Fig. 3D, 148 RTCs, P = 10 -4 ). The percentage of overlap increased by increasing the EST copy number ( Fig. 3C and 3D), indicating that the accuracy of the enriched detection in specific tissues depends on the EST copy numbers, reflective of mRNA abundance.  Figure 4A. RTCs that were present in the Gene Ontology (GO) database [18] were used for functional clustering. As shown in Figure 4B within the Biological Process category, a significant population (10% to 20%) in the subcategory Response to External Stimulus is found in RTC-L and RTC-P. The majority of RTCs in this category are the genes related to light response.

Functional clusters of retina enriched RTCs
To determine whether the enriched TCs of different tissues contained different functional clusters, we analyzed the EST populations of lung, cerebellum, and retina that are in the GO database. About 45 functional clusters from the Biological Process and Molecular Function categories of the GO database were analyzed before subtraction and about 30 functional clusters were analyzed after subtraction and the results shown in radar graphs. Interestingly, before subtraction, a similar distribution of functional clusters appeared among the three libraries in either Biological Process (Fig. 4C) or Molecular Function (data not shown) categories. After subtraction, using the criteria of > 3 EST copies and an enrichment ratio of 2, the distribution of functional clusters shifted in specific directions in Biological Process (Fig. 4D, Table 1) categories. This suggests that using subtracted subsets for classification of ESTs in individual tissues can provide accurate information about functional specificities.
We also looked at distinct functional profiles at an early phase (embryo) including E13.5, E14.5, and PN1 and an adult phase, using the criteria of > 3 copies RTCs and 83% (ratio > 5) enrichment in one phase compared to the other (Fig 4E and 4F). We found, that in retina-enriched RTC-pools (826 RTCs, > 3 EST copies, 2 fold enrichment) the majority of the RTCs were found in both phases (495 RTCs, 59.9%). 34.4% (284 RTCs) were found only in the adult phase, and 5.7% (47 RTCs) were found only in the embryo phase.
To examine the different functional clusters in these two phases, categories of the GO database were analyzed. In the biological process category, 18% of RTCs belonged to Cell Organization and Biogenesis clusters and 12% to Cell Cycle clusters in the embryo phase compared with 6% and 5%, respectively, in the adult phase. Conversely, about 24% of RTCs were in the Transport clusters in the adult The percentage (bar-graph) of RTCs overlapping between computational assay (RTC-R) and microarray assay at distinct microarray ratio, log2 = 1 (C) and log2 = 1.5 (D). Line represents the numbers of RTCs. X-axis is copy number in same RTCs.

Microarray approach and confirmation of retina enriched RTCs
GO analysis before and after RTC subtraction phase and only 8.5% in the embryo phase (Fig. 4E). Although there was the same percentage in Metabolism clusters (Fig. 4E) in both the embryo (55%) and adult (56%) phases, detailed analysis showed that about 17% of RTCs were in Biosynthesis clusters in the embryo phase compared with 7.7% in the adult phase (Fig. 4F). Interestingly, Lipid metabolism (5.8%) and Energy pathways (3.9%) were significantly higher in adult phase and in embryo phase, respectively. As expected, overall comparison of these phases shows a change from functions characteristic of a proliferating epithelium (such as cell cycle) to those characteristic of a mature retina (such as transport).

A comprehensive transcript profile of mouse retina
The computational processes described above classified a total of 33,737 RTCs for their statistical probability of being retina specific or retina-enriched, as listed in Additional data file 5. Because of the possibility of alternative splicing, we did not combine all UniGene clusters together and kept the non-redundant sequences as our cluster units. Thus, there may be some gene redundancy in the 33,737 RTCs. If we ignore this potential redundancy among the RTCs, 25,673 RTCs can be clustered into 14,618 UniGene clusters (57%) and 8,064 RTCs have not yet been clustered. If we assume the same ratio of RTCs to clusters for un-clustered RTCs, we estimate that about 19,000 to 20,000 genes might expressed in mouse retina from embryonic to adult stages.
Seven sets of information are provided in the additional data file 5. First, a set of basic information including RTC I.D. numbers, Genbank I.D. for reference sequences, locus    [6], and seventh, human retina transcript information from RetBase [7].
Using the RTC database, we can identify new genes specifically expressed or enriched in the retina for which there is not yet any biological evidence. For example, a total of 37 known genes with homeodomains from the RTC pool are listed in Table 2. Some of the genes have been well studied like Crx [19,20], Rax [21], Otx2 [22,23], and Prox1 [24]. Most of these genes, however, have not previously been described in retina, such as Og9x [25], Lhx [26,27], and Onecut [28,29]. Three of the sine oculis (so)/Six family of genes were present in the RTC pool (Fig. 5A). ESTs of Six3 and Six6 were 60% to 80% enriched in retina. The patterns have been confirmed by biological studies of expression level [30] and function analysis [31][32][33] of these genes. Ten members of forkhead/winged helix family appeared in our RNA pools and only one, Foxn4 is highly concentrated in the retina (Fig. 5B). This observation has also been confirmed by a recent study [34]. The genes from those two gene families are highly enriched in the embryo phase. Lhx3 and Lhx4, from the LIM homeodomain gene family, are highly enriched in the adult retina (Fig. 5C). Similarly, genes such as those encoding members of the guanine nucleotide binding protein family, the ATP-binding cassette protein family, the voltagedependent calcium channel protein family, and the potassium voltage-gated channel protein family (additional data file 6) are also present in the retina enriched pool. We do not have biological evidence for all the genes that have been listed in this comprehensive profile of RNAs expressed in retina, yet our gene-tables can be useful tools for analyzing mouse retina transcripts and can provide an overview of genes involved in function and development of retina.
From the RTC database, we can also extract identify candidate genes for human retinal diseases. As shown in Figure  2C, using the criteria of enrichment ratio > 2 and at least 3 EST copies in RTC, 826 RTCs were identified as a retina enriched gene pool. By homology with the human genome, this RTC pool covers about 80% of knownphotoreceptor related genes (29 out of 37, Fig. 2C and  2D). Through comparison with individual interval gene numbers, numbers of candidate RTCs are concentrated to 1.5% to 0.05% (additional data file 7). A full list of known human retina disease genes and the subtraction information are shown in additional data file 8. Among retina-dis- ease-related genes, almost all photoreceptor related genes especially the genes involved in phototransduction processes are covered by our most stringent subtraction and selection criteria (> 3 EST copies and 2 fold enrichment). However, genes related to systemic diseases or syndromes, genes involved in RNA processing and genes with lower copy numbers in RTCs are not present under these criteria, indicating a limitation of this method. Although candidate genes from the 826 RTCs pool can cover over 41 of 48 mapped human retina disease loci, here we provide only the genes for 27 loci that have been more stringently associated with primarily photoreceptor disorders (additional data file 7 and 9).

Distribution of enriched RTCs in gene families
Comparison with other approaches for retina enriched gene subtraction shows obvious gaps among the various experimental approaches (Table 3). Here we have compared our enriched gene pool with the pools from SAGE [6] and RetBase [7]. Only genes clustered by UniGene are used using two criteria from SAGE pool with two standards [6], 2 out of 4 (71 genes) and SAGE 264 (264 genes) respectively; one criteria from Retbase (373 genes, [7]); and one criteria with at least 3 EST copies and 2 fold enrichment in RTCs (184 genes). The highest 22% overlaps between our and SAGE pools and the lowest 5% overlaps between Retbase and SAGE pools.

Discussion
The full specification of transcripts in an individual tissue or cell type can greatly aid the understanding of the control of cell differentiation and cell function. In the present study, we have integrated computational bioinformatics and microarray experimental approaches to classify the tissue specificity and developmental distribution of mouse retina transcripts.
We have defined 33,737 retina transcript clusters (RTCs) as single units with non-redundant sequences, although multiple transcript clusters can be in the same UniGene cluster. Such RTCs may represent different parts of a gene, or splice variants that are not considered in the UniGene database. We have calculated that about 19,000 to 20,000 genes may be expressed during mouse retina development from embryonic day 13.5 to adult. This is about 30% more compared with Schulz's Retinome (13,037 genes) [14], although this was restricted to transcripts in the adult phase. Since the nervous system, including the retina, has more prolific RNA splicing (2.5 fold higher that other tissues) [35], we suggest that the numbers of unique transcripts found in retina may well be greater than 20,000. More interestingly, since 39.1% of the RTCs are not included in the most comprehensive mouse fulllength transcript set made by RIKEN [5], tissue-specific transcripts and splicing may constitute a substantial proportion of the whole transcriptome. On the other hand, there are large numbers of ESTs (about 16,000) that do not appear to encode proteins. Many of these may serve as regulatory RNA or have other unknown functions and need further study [5,13].
A major concern for specification of transcripts from ESTs is how to confirm a reliable result. We have used four control steps to verify the results. First, as an external negative control we compared the specificity in our target tissue, retina, and in other tissues (lung and cerebellum). Second, we used an internal positive control. We used known photoreceptor-specific genes to check their coverage rate during different steps of retina specification. The third control was to identify functional clusters through the Gene Ontology (GO) database before and after specification of ESTs in different tissues. Biological confirmation was the final control experiment in this study and used a mouse retina tissue-specific microarray. Results from all the control experiments fully support the conclusions of this study.
Identification of retina disease genes is an important and immediate use for genome-wide study tools. Several approaches have been used for genome-wide hunting of such genes. However, as shown in our results, the data sets show obvious gaps among these various experimental approaches (  found a total of 925 ESTs likely to be specifically or preferentially expressed in the retina [7]. We found a low overlap between RetBase and our enriched pools. In part this is because of the smaller starting sample, the majority of human ESTs are generated from tumors and the emphasis on single copy sequences in the RetBase set. A SAGE analysis identified 264 uncharacterized genes that were specific to or highly enriched in rods [6][7][8][9]. This data set showed more overlap with our results but there are, still obvious gaps among these various experimental approaches, so we suggest that integrating different approaches might be much more valuable for tissue specific and enriched gene prediction.
Generation and specification of an entire transcription profile for individual tissues or cells with specific functions and morphologic identities represents the next major task in the genome era. In conclusion, this study complements and extends previous studies in a number of ways. First, we have generated a comprehensive data set of retina transcript profiles with functional and developmental explanation, as the examples shown in Figure 4 and Table 2, diverse function information can be generated through our RTC profiles that will now allow a better understanding of retinal development and function. It is not only for hunting retina disease genes, but also for understanding of gene developmental distribution. Second, this study has classified retina transcripts into different grades of retinal specificities that will help us define them as common, neural specific or retinal specific genes. Third, the distinct approaches of this study will allow an easy updating of our mouse retina transcripts databases in future.

EST resources and manipulation
Mouse retina ESTs were collected from NCBI and were filtered and cleaned up by a series of programs (Figures 1, S1 and S2). Two Dell Precision WorkStation 530's running RedHat 7.0 Linux were used for computational processes. The database used was MySQL and the languages for programming were Python, GNU C, awk, and bash. Repeat detection used RepeatMasker http://repeatmas ker.genome.washington.edu/cgi-bin/RepeatMasker (kindly provided by A. Smith and P. Green). Other analyses used custom programs.

Mice and retina sample collection
Mice were purchased from the Jackson Laboratory (C57Bl/6j) or were a gift from Dr. C. Zeiss, Yale University (C3H wt and rd1). Retinas were dissected without contamination from lens, iris, cornea, and ciliary body. 10 to 20 retinas or other organs were pooled for RNA isolation. All animal experiments were conducted in accordance with NIH guidelines and were approved by the IACUC of Yale University School of Medicine.

RNA preparation
Total RNA was isolated by TRIzol (Invitrogen) and purified by RNeasy mini kit (QIAGEN). 5 µg total RNA with 280/260 ratios greater than 1.9 was used for array hybridization without amplification. Three to four sets of RNA were prepared from each age of retina and processed individually for microarray analysis.

Microarray experiments
About 12,000 non-redundant mouse retina ESTs were generated from about 28,000 ESTs that generated from E13.5, PN1 and adult (The NIH-University of Iowa Brain Molecular Anatomy Project). 9,216 purified PCR-amplified inserts were printed by the Yale Keck Microarray Core on poly L-Lysine (Sigma) coated glass slides utilizing a GeneMachines Omnigrid robotic arrayer (GeneMachines). 3DNA Submicro EX Expression Array Detection Kits (Genisphere, PA) were used for RNA labeling. Detailed microarray experimental protocols are shown in supplemental methods. Slides were scanned on a GenePix 4000B scanner and the data were manipulated with Gene-Pix software Version 4.0 (Axon Instruments). Three or four sets of microarray data for each experiment were used for Student t test and gene collection. Gene collection methods are described in text and also please see additional data file 10 for methods detail.

Statistics
To test whether the features illustrated in Figures 2 and 3 show an increasing trend in terms of covering photoreceptor related genes and retina disease genes through more stringent filtering criteria, we conducted trend tests in the following form: , where C is the number of classes above the baseline total population, w i is the weight for the ith class, and y i is the observed percentage related to the feature of interest in the ith class.
The value of C = 2, 4, 3, and 3, for figures 2C, 2D, 3C, and 3D, respectively. The value of y i is the proportion of the ESTs with a given feature in the ith class, with the feature being either a photoreceptor related gene or a retina disease gene. The weights are 1, 2, ..., C for the C classes.
To assess the statistical evidence of a trend in the data based on T, we calculated the mean and variance of T under the null hypothesis of no trend conditional on the feature distribution in the baseline total population. It can be shown that when the feature of interest is binary, i.e. a given EST either has or does not have this feature, where Φ is the cumulative function of the standard normal distribution. Statistic calculations were done by R