Sequencing and comparative genomic analysis of 1227 Felis catus cDNA sequences enriched for developmental, clinical and nutritional phenotypes

Background The feline genome is valuable to the veterinary and model organism genomics communities because the cat is an obligate carnivore and a model for endangered felids. The initial public release of the Felis catus genome assembly provided a framework for investigating the genomic basis of feline biology. However, the entire set of protein coding genes has not been elucidated. Results We identified and characterized 1227 protein coding feline sequences, of which 913 map to public sequences and 314 are novel. These sequences have been deposited into NCBI's genbank database and complement public genomic resources by providing additional protein coding sequences that fill in some of the gaps in the feline genome assembly. Through functional and comparative genomic analyses, we gained an understanding of the role of these sequences in feline development, nutrition and health. Specifically, we identified 104 orthologs of human genes associated with Mendelian disorders. We detected negative selection within sequences with gene ontology annotations associated with intracellular trafficking, cytoskeleton and muscle functions. We detected relatively less negative selection on protein sequences encoding extracellular networks, apoptotic pathways and mitochondrial gene ontology annotations. Additionally, we characterized feline cDNA sequences that have mouse orthologs associated with clinical, nutritional and developmental phenotypes. Together, this analysis provides an overview of the value of our cDNA sequences and enhances our understanding of how the feline genome is similar to, and different from other mammalian genomes. Conclusions The cDNA sequences reported here expand existing feline genomic resources by providing high-quality sequences annotated with comparative genomic information providing functional, clinical, nutritional and orthologous gene information.


Background
The domestic cat, Felis catus, is a member of the family Felidae and represents the Feliformia branch of the order Carnivora [1]. The domestic cat is an important companion animal and veterinary species. There are roughly 82 million companion cats living in more than 35 million US households [2]. The domestic cat also has substantial value as a model organism for comparative mammalian genomics because it is an obligate carnivore [3], unlike the dog which is an omnivore [4]. Additionally, the domestic cat is an important model organism for Felidae because of its close phylogenetic relationship to the wildcat (Felis silvestris), the sand cat (Felis margarita), the black-footed cat (Felis nigripes) and the jungle cat (Felis chaus). It can also serve as a model for more distantly related felid species including pumas such as the Cheetah (Acinonyx jubatus), lynx species, ocelots [5,6], and members of panthera including the lion (Panthera leo), the tiger (Panthera tigris), and snow leopard (Uncia uncia) [7]. A major goal of feline genomics is to identify and decode both cat-specific biology as well as conserved mammalian biology. The identification of feline-specific biochemistry and physiology is required in order to better understand the unique nutritional and veterinary needs of cats and to enhance the wellness of domestic cats as well as the health and management of captive felid species.
A number of cat-specific biological adaptations have been described to date. Cats exhibit a variety of evolutionary adaptations thought to be associated with their predatory behaviour and obligate carnivore status. For example, domestic cats exhibit distinct distal forelimb anatomical adaptations associated with predation [8,9], as well as sensory adaptations in both sound perception [10][11][12] and visual acuity [13,14]. At a molecular level, cats exhibit differences in the regulation of sugar transporters [15] resulting in lower liver glucose transporter activity [16] and differences in carbohydrate metabolism compared to omnivores [17]. Because the carnivore diet is relatively high in amino acid content, adult cats maintain blood glucose levels from gluconeogenesis of glucogenic amino acids, lactic acid and glycerol [18]. Compared to omnivorous mammals, in which gluconeogenesis occurs in the post absorptive state, cats exhibit the greatest extent of gluconeogenesis right after a meal during the absorptive state [19].
Amino acid biosynthesis and deficiency has been relatively well studied in domestic cats. Cats have dietary requirements for the amino acids taurine [20], arginine [21], cysteine and, methionine [22]. Arginine deficiency in cats has been associated with rapid onset of hyperammonemia characterized by severe signs of ammonia toxicity [23]. The sulphur containing amino acids cysteine and methionine are normally present in high amounts in animal flesh and are required for normal feline development [24,25]. The beta-amino sulfonic acid taurine is required in cats because, unlike many other species which can conjugate bile acids to either glycine or taurine for secretion of bile salts into bile, cats can only use taurine. Unlike dogs, cats have evolved limited capacity to synthesize taurine [26], subsequently, taurine deficiency in cats is associated with abnormal cardiac [27], immune [28], neurological [29], platelet [30], reproductive [31] and retinal [32] dysfunctions. The recent description of the taurine transporter knock out mouse underscores the biological roles of taurine in mammals [33].
Although many aspects of feline-specific biology have been elucidated to date, bioinformatics methods and comparative genomics approaches can provide a mechanism for producing a number of plausible and useful biological hypotheses from feline cDNA sequences.
The 2007 release of the feline genome [34] marked the beginning of the feline genomics era, which was followed by the identification of close to 1 million single nucleotide polymorphisms across cat breeds [35] which further extends the repertoire of genomic tools for investigating the genomic basis of feline phenotypes. In this paper, we describe the sequencing of additional feline cDNA sequences and demonstrate the utility of employing comparative genomics methods to investigate, not only the roles of these cDNA sequences, but the extent to which these feline sequences diverge from other mammalian orthologous sequences.
Our working hypothesis is that conservation among human, mouse, dog and cat orthologs underscores conserved mammalian biology while feline sequence divergence among mammalian orthologs provides potential insight into cat-specific biology. Specifically, we employ a computational comparative gene expression analysis to map the cDNA sequences to anatomical information, developmental timelines, cells and pathology terms. Additionally, we utilize the gene ontology annotation, in combination with measures of synonymous and nonsynonymous differences in orthologous protein sequences, to better understand which of the cDNA sequences are likely to represent conserved mammalian biology and which are more likely to represent felinespecific biology. We organize these results into biological processes, cellular localization and molecular function in order to more easily interpret the results. Finally, we map these feline cDNA sequences to orthologs in other species in order to identify (1) phenotypes, (2) biochemical pathways and (3) human diseases in an attempt to better understand the roles of these cDNA sequences in feline development, nutrition and disease.

Results
Sequencing and Orthologue Identification 1227 high quality feline cDNA sequences were identified from a starting set of 3035 cDNA sequences ( Figure 1). Total RNA was purified from 21 feline tissues (brain, kidney medulla/cortex, spleen, heart, liver, lung, skeletal muscle, thyroid gland, lymph node, pancreas, adrenal gland, tongue, colon, mammary gland, neonatal thymus, brain and testes) collected from 10 domestic shorthaired cats post-mortem, three cell lines derived from kidney, brain, lung, and 1 tissue pool using standard procedures. The initial set of 3035 cDNA sequences was assembled from the sequencing reads from tissue specific cDNA libraries. These sequences were designated full length because they corresponded to the complete length of assembled sequencing reads. These sequences were translated to produce protein sequences and clustered in nucleotide space and protein space to identify a set of non-redundant full length sequences. The results of the clustering produced 3028 nucleotide clusters and 2834 protein clusters. The intersection of these two sequence sets was used to produce the final clustered full length sequences, for which there were 2831 sequences. The set of clustered sequences were filtered to remove sequences containing non-nucleotide and non-amino acid letters which resulted in a set of 2081 high quality non-redundant full length sequences.
For the set of 2081 cDNA sequences, the shortest and longest sequences were 353 and 4750 nucleotides respectively. The average nucleotide length was 1349 nucleotides with a standard deviation of 567 nucleotides. The 2081 protein sequence set exhibited a shortest and longest sequence of 41 and 1128 amino acids respectively. The average protein sequence length was 279 amino acids with a standard deviation of 149 amino acids.
This set of sequences was used to blast against the set of known human cDNA and protein sequences to identify the best human match (see Figure 1). Additionally, these 2081 cDNA sequences were blasted against known and ab initio feline cDNA and protein sequences from ensemble [36] to identify sequences for which public feline sequence data exists. Subsequently, these sequences were aligned using a global alignment algorithm to remove sequences for which the best blast hit represented only local homology. After manual review of all of the global nucleotide and protein alignments, a set of 1227 non-redundant feline sequences were selected as high confidence, high quality feline sequences. Within the set of 1227 sequences, 913 known sequences and 314 novel sequences were identified for which 914 were successfully mapped to their corresponding dog, human and mouse orthologs. Although additional non-redundant feline cDNA sequences we identified mapped to three or fewer orthologs across the four species, we limited our subsequent analysis to only those sequences for which all three non-feline species orthologs were confidently identified. This decision was made to ensure that our functional and comparative analysis would include only feline cDNA sequences for which dog, mouse and human orthologs were identified. Of the 914 orthologous sequence set, 844 sequences corresponded to known feline sequences and 70 corresponded to novel sequences (see Figure 1). Additional file 1, Table S1 contains the complete set of 1227 nonredundant nucleotide and protein sequences. The complete set of 914 orthologous sequences is listed in Additional file 2, Table S2 along with the designation of known or novel and the corresponding ensembl gene, transcript and protein identifiers for the dog, human and mouse orthologs.
It is interesting to note that compared to the existing public feline sequences, the sequences we identified exhibited a trend toward longer length and fewer sequencing errors. For example, of the 913 sequences that correspond to known feline public sequences, 309 of the public sequences contain a non-nucleotide sequence character such as an N or an X. Within those public sequences containing N's or X's, 292 are shorter than the corresponding sequence we identified and only 17 of the public sequences containing non nucleotide letters are longer than the sequences we identified. Within the set of 604 public sequences mapped to our known sequences that do not contain N's or X's, 597 public feline sequences are shorter in length than the feline sequence we identified with only 7 public sequences having a longer length than our feline sequences. Figure 2 shows the distribution of nucleotide and protein sequence lengths for our set of 1227 sequences.

Comparative Gene Expression Analysis
The sequences we report were obtained from extensive sequencing of 21 individual tissue cDNA libraries and 1 pooled cDNA library. It is well known that while some Sequences. An initial set of 3035 cDNA sequences were clustered in nucleotide and protein space to identify the longest representative sequence for each cluster. The intersection of the set of cDNA and protein clusters resulted in a set of 2831 cDNA sequence clusters. All sequences within this set that contained N's were removed resulting in a set of 2081 high quality, nonredundant cDNA sequences. These sequences were blasted against the (1) set of ensembl human known cDNA and protein sequences and (2) feline known cDNA and protein sequences. Global alignments were generated for each cDNA blast hit and manually inspected for quality. The final set of 1227 cDNA sequences corresponded to 913 known feline cDNA sequences and 314 novel feline sequences. Blasting to dog, human and mouse sequences identified a total of 914 orthologs, corresponding to 70 novel and 844 known sequences.
genes may exhibit rather narrow ranges of expression across tissues and cell types, many genes exhibit expression across numerous tissues and cell types [37,38]. We chose to leverage the orthologous relationships among our sequences to infer gene expression patterns across a set of anatomical regions.
When considering the inferred gene expression patterns as a function of anatomical regions, we were able to identify 114 anatomical regions exhibiting expression of 766 genes encoding our sequences. The range of gene counts, we identified, was from 1 gene each in lymph, rectum and cerebrum to 752 genes for the anatomical term lung. The eight anatomical terms exhibiting the lowest gene counts with more than a single gene include middle ear, corpus callosum and trachea (2 genes each, 0.21%), subthalamic nucleus and foreskin (3 genes each, 0.32%), epidermis and ciliary body (4 genes each, 0.44%) followed by adrenal medulla and internal ear (5 genes each, 0.54%). The eight anatomical terms exhibiting the greatest gene counts, each contain at least 73% of the genes corresponding to our cDNA sequences. The top eight anatomical regions listed in ascending order are liver (668 genes, 73%), skin (676 genes, 74%), colon (686 genes, 75%), placenta (689 genes, 75%), kidney (693 genes, 76%), testis (703 genes, 77%), brain (725 genes, 79%) and lung (752 genes, 82%). Table 1 contains the anatomical gene expression annotation results.  The expression pattern annotation corresponding to cell type resulted in gene counts for 44 cell types ranging from 1 gene (0.1%) each for brown adipose cell, platelet and eosinophil to 2 genes (0.22%) in mast cell and 3 genes (0.33%) in hepatocyte. A count of 13 genes (1.4%) was obtained for monocytes, while counts of 33 genes (3.7%) each were reported for both cardiac muscle cell and chondrocyte. At the other end of the expression spectrum, the term stem cell was associated with 626 genes (70.8%), B-lymphocyte (628 genes, 70.7%), epithelium (604 genes, 68.5%), retinal pigment epithelium (514 genes, 57%), skeletal muscle cell (499 genes, 56.6%), fibroblast (485 genes, 55%) and germ cell (435 genes, 49%). The cell expression counts provide cellular expression annotation for 749 of our orthologous genes. Table  2 contains the counts for all of the cell type expression annotations.
The mapping of pathology term expression annotation with our orthologous gene sets resulted in 57 terms having gene counts. The terms with the fewest gene counts included ulcerative colitis, neoplasia, rheumatoid arthritis, cirrhosis, and hyperplasia each of which exhibited a The anatomical expression pattern of the gene corresponding to each cDNA sequence was inferred. The human orthologs of each cDNA sequence were used to infer anatomical gene expression patterns using expression data (egenetic data) obtained from biomart. The results include 114 anatomical regions exhibiting expression of 766 genes encoding the cDNA sequences. The number of genes with inferred expression in each region is indicated (Number of Genes), as well the percentage of genes with inferred expression in each region (% of Genes).  The cell type expression pattern of the gene corresponding to each cDNA sequence was inferred. The human orthologs of each cDNA sequence were used to infer cell type gene expression patterns using expression data (egenetic data) obtained from biomart. The results include expression across 44 cell types. The number of genes with inferred expression in each cell type is indicated (Number of Genes), as well the percentage of genes with inferred expression in each cell type (% of Genes).
Taken together these results suggest that the genes encoding the cDNA sequences we have identified  The pathology expression pattern of the gene corresponding to each cDNA sequence was inferred. The human orthologs of each cDNA sequence were used to infer pathology gene expression patterns using expression data (egenetic data) obtained from biomart. The results include 57 pathology terms exhibiting expression of the cDNA sequences. The number of genes with inferred expression in each type of pathology is indicated (Number of Genes), as well the percentage of genes with inferred expression in each type of pathology (% of Genes).
exhibit considerably larger breadth of expression than would be suggested from the initial tissues that were sequenced. The broad extent of tissue, cell type, developmental and pathological expression annotation suggests that these sequences may include sequences underlying tissue and organ development as well as contributing to specific pathological conditions. In order to better understand the biological role of these genes we chose to combine the expression annotation with other functional and comparative annotation types.

Gene Ontology Annotation Analysis
Gene ontology (GO) annotation was performed on the feline sequences using the previously identified comparative genomics ortholog relationships. Gene ontology terms were mapped from human annotation files to feline orthologs. The initial gene ontology human molecular function annotation file contained 73,467 function annotation terms mapped to 21,956 human gene identifiers, corresponding to 3,085 unique gene ontology function terms. The cellular location gene ontology annotation file contained 975 unique terms mapped to 21,956 human genes resulting in 69,556 gene-term relationships. The biological process gene ontology annotation also contained 21,956 human gene identifiers consisting of 6518 unique gene ontology process annotation terms represented by 89,968 gene-to-GO entries. The mapping of gene ontology functional annotation terms onto the non-redundant full length sequences resulted in 901 of our feline cDNA sequences becoming associated with 647 unique gene ontology molecular function annotation terms resulting in 3219 annotationgene relationships. Repeating the procedure to map the cellular location annotation, we mapped 3423 geneannotation relationships corresponding to 337 unique location annotation terms covering the set of 901 genes. Mapping the biological process annotation terms produced 4247 gene-to-GO annotations of which 1441 unique gene ontology process annotations mapped successfully to 901 genes.
Typically gene ontology annotation terms are filtered using an enrichment criterion that is calculated from a hypergeometric null model to describe the number of annotation terms one might expect to occur within a gene set of a given size and a GO annotation distribution of particular parameters. Although such an approach is necessary when attempting to determine the biological role of a gene set, such as up-regulated or down-regulated genes in a gene expression study, we did not calculate an enrichment of gene ontology terms, instead we combined the gene ontology annotation with measures of evolutionary selection using non-synonymous (dN) versus synonymous (dS) codon statistics as a means of exploring the evolutionary relationships that exist among the different gene ontology annotations across our cDNA sequences. A well accepted approach for identifying evidence of positive selection is to identify genes exhibiting significantly larger rates of nonsynonymous substitutions per non-synonymous site than synonymous substitutions per synonymous site. Evidence of fixation exists when the ratio of non-synonymous substitution rate to synonymous substitution rate equals zero (dN/dS = 0).
Evidence of negative selection exists when dN/dS < 1 and evidence of positive selection exists when dN/dS > 1. We recognize that using the dN/dS value across an entire gene is an extremely conservative measure of selection, and that smaller regions within a gene may exhibit local signals of positive selection [39]. However, we chose the conservative approach in order to minimize reporting false positives due to the possibility of sequencing errors.
Instead of considering all of the genes we identified as a single gene set, we chose to select gene subsets using The developmental expression pattern of the gene corresponding to each cDNA sequence was inferred. The human orthologs of each cDNA sequence were used to infer developmental stage by week of gestation, year of age and life stage of human development. The number and percentage of genes with inferred expression in each stage is indicated in the second and third columns respectively.
SQL queries in MySQL to identify cDNA sequences sharing gene ontology annotation terms for which we calculated an average dN/dS value. From this analysis, we were able to identify annotation types exhibiting low dN/dS values, corresponding to greater levels of sequence conservation across species. We were also able to identify annotation terms that exhibited considerably higher dN/dS values indicating less negative selection in the act on some types of genes. Because we chose to employ a stringent criteria for positive selection, we did not identify genes exhibiting strong signals of positive selection, instead, we were able to identify genes and annotation types with different levels of selection pressure acting on them. Beginning with the gene ontology location annotation, an SQL query was performed such that the genes exhibiting the same location annotation terms were grouped together and the average dN/dS value was calculated for cat versus dog, cat versus human and cat versus mouse. Location annotations occurring within gene sets that exhibit extremely low dN/dS values and very low standard deviation of the dN/dS value for each species were selected as negatively selected location annotation gene sets. A number of genes grouped by the same gene ontology location annotation terms exhibited dN/dS values close to zero, (dN/dS < 0.07). These genes were associated with several cellular themes which were each associated with multiple location annotation terms. See Figure 3 for a representative map of gene ontology location annotation terms across the dN/dS values. The following terms related to microtubules and cytoskeletal organization occurred: microtubule associated complex (2 genes, dN/dS = 0), actin cytoskeleton (7 genes, dN/dS = 0.03), microtubule (9 genes, dN/dS = 0.05), cytoskeleton (32 genes, dN/dS = 0.05) and microtubule organizing center (6 genes, dN/dS = 0.06). A muscle theme was present within the negatively selected location annotations. Muscle associated location terms included myofibril (2 genes, dN/dS = 0.02), Z disc (5 genes, dN/dS = 0.04), sarcomere (2 genes, dN/dS = 0.05) and muscle myosin complex (3 genes, dN/dS = 0.06). Additional location terms within this group included chromatin (3 genes, dN/dS = 0), nucleosome (4 genes, dN/dS = 0) and nuclear pore (3 genes, dN/dS = 0.06). The last theme observed within this group relates to intracellular trafficking and includes terms such as lysosomal membrane (2 genes, dN/dS = 0.01), golgi stack (3 genes, dN/ dS = 0.01), trans golgi network transport vesicle (2 genes, dN/dS = 0.02), ER-Golgi intermediate compartment membrane (3 genes, dN/dS = 0.03) and SNARE complex (6 genes, dN/dS = 0.04).
The themes observed in this data provide insight into the inner workings of the cell and shed light on the  evolutionary constraints that act on different components of the intracellular machinery. The fact that the these feline sequences include a distribution of gene products, some of which are strongly conserved across human/mouse/dog, suggests that these sequences include genes that play very important roles in critical cellular processes and correspond to conserved mammalian cellular biology. However, some genes map to protein products that have relatively less selective pressure acting on them. These gene products are also important because they represent the targets of adaptive evolution within the cell. While microtubule structure and function must be highly conserved, regulatory gene products are freer to evolve new interactions that may increase fitness of the cell. Figures 3 through 5 contain the three types of gene ontology annotation together with the average dN/dS values for genes exhibiting the same annotation types. Although this analysis of dN/dS values across our genes provided a gene level picture of our data, we wanted to investigate the large-scale pattern of dN/dS values across our cDNA sequences.

GeneGO Analysis of Orthologous Genes by dN/dS Value
In order to gain a more global view of how the feline cDNA sequences compared to other species, a set of 711 cDNA sequences having orthologs containing gene ontology annotation across dog, mouse and human were analysed to detect any non-random patterns across the genes, species and annotations. We sorted a list of 711 genes by dN/dS value and identified 3 groupings, corresponding to the top 25% of dN/dS values, the bottom 25% of dN/dS values and the middle 50% of dN/dS values. Each list was used to query the GeneGO annotation database for metabolic pathways.
The GeneGO database is based on the data and annotation of the Gene Ontology (GO) consortium which has collated biological annotations regarding the known or inferred roles of gene products, providing a powerful resource for identifying relationships among groups of genes, and thereby allowing the expansion of data analysis from single genes to gene sets. The GeneGO sowftare package identifies enrichment gene sets corresponding to metabolic and/or signalling networks using a hypergeometric model to calculate the null model probability for a set of genes. Enrichment is identified as an extremely unlikely probability under the null model. The results obtained by the GeneGO analysis indicate that the genes exhibiting higher dN/dS values were associated with specific metabolic pathways and biological processes. (Figure 6).
The heat map shows that for most genes, the dN/dS values are similar across different species. In order to see if any selection bias exists for different metabolic pathway annotations, the 711 genes were divided into 3 groups according to dN/dS value from dog/cat group. The first group contains the most conserved 178 genes with dN/dS values less than 0.0149, the second group contains the most divergent 178 genes with dN/dS values greater than 0.1229. The third group contains the remainder of genes having dN/dS values between 0.0149 and 0.1229 (see Table 5).
We examined the metabolic networks of these genes in GeneGO. We observed that the group with lower dN/dS values exhibited fewer numbers of amino acid type metabolic networks than the group with larger dN/ dS values. Our examination of these metabolic network annotations across the groups of genes provides insight into an interesting pattern that was not apparent from the gene level gene ontology analysis described in the preceding section.
We discovered that the group of genes with smaller dN/dS values are in metabolic networks exhibiting enrichment for carbohydrate metabolism, while the group with larger dN/dS values is associated with more metabolic networks involved in amino acid metabolism (See Table 6). Such patterns of more negative selection acting on carbohydrate metabolism and relatively less negative selection acting on amino acid metabolism may underlie an adaptive evolutionary role for genes associated with amino acid metabolism between obligate carnivores and omnivores. This result is in agreement with known differences in amino acid nutritional requirements between different species. This suggests Figure 6 Heat map of dN/dS values for Cat compared to Dog, Mouse and Human. A set of 711 cDNA sequences with orthologs in dog, mouse and human were sorted by dN/dS (w) value to generate three groups corresponding to the top 25%, bottom 25% and middle 50%. Each list was used to query the GeneGO database for metabolic pathways. A non-random pattern was observed with genes with higher dN/dS (w) more frequently associated with metabolic pathways. Red indicates higher dN/dS (w) and blue corresponds to lower dN/dS (w) value.
that depending on dietary sources and metabolic requirements, the evolution rate may not be the same across all metabolic networks. These results provide an initial analysis of these genes and might be interpreted to suggest that genes associated with amino acid metabolism and biochemical utilization might have undergone different evolutionary selection among obligate carnivores compared to omnivores and herbivores. Such a hypothesis requires further exploration and may ultimately provide the genomic rationale of the need for feline specific nutritional needs that are distinct from other species, including dog.

Identification of Metabolic and Biochemical Pathways
Based on the GeneGO findings, we wanted to gain further insight into the biochemical role of the feline cDNA sequences. We chose to further explore how our cDNA sequences map onto metabolic pathways by identifying a set of pathways for which at least one pathway member has been identified in the set of our orthologous cat cDNA sequences. This analysis identified ten distinct classes of biochemical pathways for which 112 feline cDNA sequences have been mapped to 75 different pathways.
The categories of pathways include amino acid metabolism, biosynthesis of secondary metabolites, carbohydrate metabolism, energy metabolism, lipid metabolism, nucleotide metabolism as well as glycan biosynthesis and metabolism, metabolism of cofactors and vitamins and xenobiotic biodegradation and metabolism.
We identified 29 cDNA sequences in pathways underlying common amino acid metabolism pathways and 9 cDNA sequences involved in other amino acid metabolic pathways. We found 29 cDNA sequences that are involved in the metabolism of carbohydrates, 19 cDNA sequences involved in energy metabolism, 7 cDNA sequences associated with glycan biosynthesis and metabolism and 33 cDNA sequences that are involved in lipid metabolism. Additionally, we have identified 18 sequences that participate in the metabolism of cofactors and vitamins, 16 cDNA sequences that are involved in nucleotide metabolism and 12 that are involved in xenobiotic biodegradation and metabolism. Table 7 provides a summary of gene counts for these pathways.  The ten most highly conserved, and the ten most highly divergent metabolic pathways are listed along with the p-value for each pathway. The most conserved pathways are associated with the lowest dN/dS values, whereas the most divergent pathways are associated with the highest dN/dS values.

Comparative Phenotype Analysis
Phenotype annotation can provide additional information regarding the physiological function of a gene. Although our dataset includes 1227 cDNA sequences, one of our goals was to identify a relatively small subset of feline genes that represent important clinical, developmental and nutritional aspects of feline biology. This comparative phenotype analysis resulted in the identification of a pleiotropic set of genes that were partitioned into seven phenotype modules, each of which contains a relatively small number of genes that contribute to a comparatively large set of feline relevant phenotypes. The term phenotype module was adapted from the notion of a gene expression module, in which the set of genes exhibit similar patterns of spatial or temporal expression. Each phenotype module was constructed by grouping genes exhibiting related phenotypes based upon the phenotype classes described in the mammalian phenotype browser [40]. Similar phenotypes were grouped by body system and/or common biological processes to create the final set of phenotype modules. These seven modules provide a body system distributed view of the phenotypic roles of some of the genes that encode our 1227 cDNA sequences. The modules, genes and associated phenotypes are included in Table 8.
The cardiac module consists of eight genes and is associated with the following eight phenotypes: cardiac hypertrophy, dilated dorsal aorta, abnormal mitral valve morphology, abnormal cardiac output, abnormal myocardial fiber physiology, enlarged heart, abnormal outflow tract and abnormal coronary artery morphology. This module contains genes that are of relevance to feline cardiac disease such as hypertrophic cardiomyopathy and developmental defects of the heart.
The developmental-patterning module consists of seven genes and is associated with phenotypes that include abnormal mesoderm development, abnormal proximal/distal developmental patterning and abnormal rostral/caudal developmental patterning. Within this module we identified genes associated with distinct cell differentiation and specification properties such as embryonic growth arrest, abnormal trophoblast layer morphology and abnormal white adipose tissue. Additional phenotypes within this module were associated with retinal formation, renal function, intestine morphology as well as cholesterol, triglyceride and corticosterone levels. The phenotypes within this module may be useful in dissecting the genetic mechanisms underlying inherited developmental abnormalities in both domestic and endangered felids.
The third module is an immune and hematopoietic module that contains nine genes and represents phenotypes associated with specific cell types and lineages including macrophage physiology, spleen germinal cell number, granulocyte number, platelet number, T-cell and B-cell proliferation and hematopoiesis. Furthermore, this module exhibited phenotypes associated with susceptibility and resistance to pathogens such as abnormal immune system biology, abnormal class switch recombination and altered rate of infection. Some of the phenotypes in this module, including abnormal somatic hypermutation frequency and lymphoid hyperplasia were related to cancer, perhaps representing the immune surveillance component to the control of tumorigenesis within the body. Finally, some of the phenotypes within this module were associated with the modulation of specific immunologically important molecules such as cytokine secretion, interferon secretion, IgE levels, IgG1levels, and IgM levels. Genes within this module may offer some insight into feline specific immunological and inflammatory disorders.
The fourth module, energy/nutrition and homeostasis consists of six genes and exhibits a number of phenotypes associated with energy production and regulation within cells. Some of these phenotypes include decreased circulating glucose level, decreased oxygen consumption, abnormal gluconeogenesis and increased glucagon. Other phenotypes include endocrine level regulation of the organism such as abnormal body weight and decreased body temperature. Additionally, there were phenotypes associated with diseases of energy metabolism such as diabetes, these phenotypes included insulin resistance, abnormal glucose homeostasis and increased circulating insulin level. These phenotypes provide a context for better understanding of the unique nutritional and energy requirements of the cat.
The fifth module has five genes and encodes a tumorigenesis module associated with the following phenotypes: B-cell derived lymphoma, increased sensitivity to oxidative stress, increased apoptosis, decreased cellular sensitivity to gamma irradiation, increased incidence of ionizing radiation induced tumors, increased tumor incidence, malignant tumors, adenocarcinoma. The genes in this module may provide a useful gene set for investigating the genetic basis of feline lymphoma and carcinoma.
Module six is a sensory systems module, containing five genes, and is associated with the following visual phenotypes: abnormal lens fiber morphology, cataracts, abnormal optic nerve morphology, abnormal eye electrophysiology, abnormal vision, blindness and optic nerve atrophy. Cats exhibit vision related abnormalities under certain nutritional deficiencies; the genes associated with these phenotypes may provide a better understanding of the observed link between feline nutrition and visual function. Other phenotypes within this module include both hyperekplexia and decreased startle response, which may underlie feline adaptations required for successful predation. The seventh module is a behavioral/neurological and nervous system set that contains 11 genes. The behavioral phenotypes arising from this module span traits as diverse as motor coordination and balance through learning, memory and gait. Additional phenotypes in this module are associated with emotion and affect as well as vocalization and maternal behavior. Within this module, we identified a number of phenotypes underlying neuronal specific physiological mechanisms such as altered synaptic transmission, altered long term potentiation, abnormal excitatory post synaptic potentials and decreased neurotransmitter release. This module contains a variety of developmentally important nervous system phenotypes having anatomical or histological annotations. These include abnormal brain commissure morphology, abnormal brain development, abnormal embryonic neuroepithelium layer differentiation as well as open neural tube, abnormal cerebellar granule layer, abnormal Purkinje cell layer, small cerebellum, abnormal brain ventricle morphology, abnormal cerebral cortex morphology and abnormal forebrain and hindbrain morphology. Finally, we identified specific CNS phenotypes of clinical importance such as abnormal neuron morphology, abnormal neuron physiology, astrocytosis, brain stem haemorrhage, gliosis and inter cranial haemorrhage.
We chose to focus on a relatively small number of gene-phenotype relationships in order to explore a relatively high resolution picture of important feline phenotypes that may be representative of our cDNA sequences. Our goal was to determine if any of our cDNA sequences were associated with phenotypes that may be of value in understanding the genetic basis of feline specific biology. Our analysis demonstrates that some of our cDNA sequences are indeed associated, through comparative genomics sequence analysis using      the mammalian phenotype browser database, with phenotypes that are extremely important in feline health and disease. These modules and related genes provide an important and extremely useful candidate gene set for domestic cat functional genomics.

Orthologous OMIM Diseases
We identified 104 feline cDNA sequences that are orthologs of human genes for which an OMIM (Online Mendelian Inheritance In Man, http://www.ncbi.nlm.nih. gov/omim) disease has been associated (see Table 9 and Additional file 3, Table S3). Within this data set we observe genes implicated in both dilated and familial cardiomyopathy as well as genes associated with oxidative phosphorylation deficiencies and biochemical disorders of amino acid metabolism. The OMIM associated diseases paralleled the phenotype associations we detected and provided additional insight into the clinical and nutritional role of the cDNA sequences we identified.
Within the set of OMIM diseases, we identified biochemical and metabolic diseases such as disorders of oxidative phosphorylation and glycosylation as well as D-2-hydroxyglutaric aciduria, glycogen storage disease, phenylketonuria due to dihydropteridine reductase deficiency and phosphoglycerate kinase 1 deficiency. Among the disease annotations associated with cancers, we found that our cDNA sequences were associated with specific types of OMIM annotations including breast cancer, colon cancer, esophageal carcinoma, lung cancer, pancreatic cancer and ovarian cancer to name a few. We also identified diseases of the sensory systems, such as cataracts and deafness. Finally, we discovered a variety of orthologs of human genes implicated in specific disorders, including Leigh syndrome, Hyper-IgD syndrome, immunodeficiency associated with hyper IgM, Griscelli syndrome, STAR syndrome and Charcot-Marie-Tooth disease, retinitis pigmentosa and generalized epilepsy. Together, these OMIM annotations provide a diverse picture of the genes across diseases and offer a unique context for understanding the role of these cDNA sequences in feline health and disease.

Discussion
We identified 1227 feline cDNA sequences derived from tissues obtained from ten cats and performed extensive comparative genomics functional analysis to elucidate the computationally derived comparative gene expression analysis patterns, biochemical functions and phenotypes associated with these sequences. Our cDNA sequences and associated comparative and functional analysis provide an initial perspective on feline biology as viewed through our set of 1227 cDNA sequences. Although it is predicted that the number of feline protein coding genes encoded in the cat genome is in the order of 20,000 to 25,000, similar to most other mammalian genomes, the number of known published cat protein coding gene sequences is much lower at 2099 sequences (NCBI databse, 2011). These 1227 cDNA/gene sequences represent a rich set of potential targets for genetic association studies, biologically relevant diets and pharmacologically active compounds which can be developed to enhance the well-being of companion cats worldwide. Additionally, these sequences have value in similar applications for endangered felids.
Our strategy to identify a set of 1227 high quality and high confidence cDNA sequences from feline tissue samples expands the expressed sequence data for domestic cat. Although we initially obtained over 3000 cDNA sequences, we chose to filter our sequences so that the set we describe would be of the most value for the feline genomics community. Specifically, the conservative strategy outlined in Figure 1 resulted in a set of 913 known sequences and 314 novel sequences (1227 sequences in total) of which 914 orthologous clusters across feline, human, dog and mouse were identified (for which 844 were known cDNA sequences and 70 were novel cDNA sequences). The genes corresponding to these 914 orthologs were used as input sequences for a variety of bioinformatics and computational analyses aimed at providing an initial perspective on the physiological and pathological roles of these sequences in feline development, nutrition and health. Although we have identified a number of interesting results using computational and sequence comparison methods, our analysis only identifies the potential roles of these genes based on comparative analysis in other species. However, validating these results and proving the function of these genes will require molecular and biochemical experimental analysis. The results of our inferred expression analysis provide a set of gene expression patterns consistent with the source tissues used for cDNA production. Of the 21 source tissues used as starting material, inferred expression patterns from each anatomical region were detected with greater than 100 genes being associated in each case. It is interesting to note that each of these tissues exhibited relatively high gene expression numbers (i.e., numbers of genes associated with anatomical expression), which is what one would expect if the inferred expression patterns were an accurate representation of the true expression patterns of the source tissues. Tissues such as brain (725 genes), heart (629 genes), pancreas (568 genes) and testis (703 genes) exhibit inferred expression of more than 60% of the genes encoding our 1227 cDNA sequences. Inferred cellular expression patterns correlated with cell types expected in the source tissues including glial cells and neurons (432 genes and 124 genes respectively), retinal pigment epithelium cells (514 genes), and skeletal muscle cells (499 genes). Together, these results provide an expression framework for understanding the roles of these cDNA sequences in feline physiology and pathology. Because greater than 70% of our cDNA sequences were associated with embryological expression patterns we were not surprised to discover that a significant number of developmental phenotypes were associated with our set of cDNA sequences. Specifically, we identified genes associated with abnormal heart morphology and abnormal cardiac blood flow, abnormal mesoderm development, abnormal developmental patterning and abnormal retinal neuronal layer morphology. These phenotypes are consistent with the expression and role of genes identified in the source tissues selected for cDNA sequencing. The fact that the inferred expression patterns exhibit greater breadth of expression than the starting tissues is in line with the notion that genes tend to be expressed in complex spatial and temporal patterns. It may be the case that the inferred expression patterns include some anatomical, cellular and/or developmental expression patterns which may be false positives, however the overall picture of expression provided by this analysis greatly enhances the value of these cDNA sequences in genomic applications. Interestingly, our analysis of gene ontology in the context of dN/dS values of individual orthologous cDNA sequences provides insight into how the domestic cat is both similar to and differs from other mammals. We detected evidence of negative selection acting on genes associated with microtubules and the actin cytoskeleton, suggesting that genes associated with these cellular structures are fairly well conserved among mammals [41,42]. Additionally, we identified gene ontology annotation terms affiliated with the nucleus, the chromosomes and DNA replication exhibiting relatively low values of dN/dS along with orthologs associated with transcriptional regulation and translational elongation. Similar values were obtained for genes annotated as Gprotein beta/gamma binding and trans-Golgi network trafficking, vesicle and endoplasmic reticulum compartment membrane and SNARE complex. This is not surprising given that the housekeeping functions of mammalian cells are relatively well conserved. All cells must transmit information from the genome into RNA and protein components in a manner that maintains the appropriate subcellular compartmentalization of molecular functions. Intracellular trafficking that diverges from cellular requirements is likely to exhibit relatively deleterious consequences leading to negative selection compared to cells that function appropriately. Microtubules are involved in cellular integrity, cell motility and cell division; all of these processes are critical for cell viability [41,43].
In comparison to these highly conserved orthologs which mediate the core cellular processes, we detect evidence of considerably less negative selection acting on orthologs associated with transmembrane receptors, apoptotic signals, guanyl-nucleotide exchange factors and GPCR activity. Additionally, we identified evidence of less negative selection among orthologs associated with extracellular spaces, mitochondrial membrane affiliation and integral proteins of the plasma membrane. Unlike the highly conserved orthologs with intracellular functions, these orthologs form the basis of interactions across cells, through the extracellular space into the nucleus and organelles by a variety of signal transduction mechanisms for which multiple paralogous genes exist in each species. Such patterns of selection have been identified by others and represent evolutionary patterns of selection that may be associated with positive selection in different evolutionary lineages [44]. Moreover, these cDNA sequences might encode proteins for which extracellular environment plays a selective role during evolution.
It is well documented that paralogs diverge at a greater rate than orthologs [45,46]. Because our analysis did not include the entire set of genes from the cat, we cannot rule out the possibility that some of our orthologs are not true orthologs. It is worthwhile to point out that our analysis included only cat, dog, mouse and human genes which effectively limits the detection of evolutionary selection using the dN/dS ratio because some of these species diverged more than 100 million years ago. Nonetheless, it is interesting that others have  Table S3).
observed similar patterns of divergence in protein networks operating at the cellular periphery and within the extracellular space [44,47]. Our analysis identified orthologs associated with respiratory chain and mitochondria as exhibiting relatively lower levels of negative selection. It is possible that the predatory status of cats resulted in adaptive changes in energy production and oxidative phosphorylation that facilitate the high energy requirements of predation.
It is interesting that we detect evidence of divergence within apoptotic genes in the cat compared to other mammalian species. This may underlie species specific differences in adaptation, such as what might be expected to have happened as obligate carnivores diverged from a common ancestor of omnivores and herbivores. The high protein requirements coupled with enhanced predatory fitness may have co-evolved with differences in cellular response to stress and cellular apoptosis, both within and outside of the brain. This hypothesis is supported by the metabolic network analysis in GeneGO where the top 25% dN/dS values were associated with metabolic pathways implicated in non-carbohydrate roles. The metabolic network analysis performed with GeneGO demonstrated that genes in the group with smaller dN/dS values are associated with metabolic networks most involved in carbohydrate metabolism, while the genes in the larger dN/dS value group are in metabolic networks most involved in amino acid metabolism. This suggests that depending on metabolic requirements, the evolution rate may not be the same across all metabolic networks, and obligate carnivores like cats, may exhibit relatively less negative selection acting on genes involved in amino acid metabolism and more neutral selection acting on carbohydrate associated genes. This result is in agreement with the observation that cats exhibit different dietary requirements for amino acids taurine [20], arginine [21], cysteine and, methionine [22]. In contrast to dogs, cats are unable to synthesize taurine from cysteine [34], subsequently, taurine deficiency in cats is associated with a variety of clinically important conditions including cardiac [27] immune [28], neurological [29], platelet [30], reproductive [31] and retinal [32] dysfunctions. Additionally, cats exhibit rapid onset of ammonia toxicity resulting from arginine deficiency and, in severe cases, may die within 24 hours [23,48].
Through the use of KEGG pathway annotation, we identified domestic cat genes involved in a variety of amino acid related pathways including the metabolism of alanine, aspartate, arginine, proline, glutamate, glycine, serine, threonine, histidine, lysine, methionine, phenylalanine, tyrosine and tryptophan. We identified specific pathways in amino acid metabolism, which tend to differ between obligate carnivores and omnivorous mammals [49]. These include six genes involved in tryptophan metabolism which are of value for cats because they are unable to synthesize niacin from tryptophan, as compared to omnivores [48]. Additionally we identified three genes involved in arginine metabolism, which is an essential amino acid in cats [26]. We identified genes involved in glutamate metabolism, which may provide insight into the metabolic consequences of the low levels of ornithine produced from glutamate in cats [48].
We also identified genes associated with pathways underlying lipid metabolism, including genes participating in biochemical pathways of linoleic, alpha-linoleic acid and arachidonic acids, which is important and noteworthy because cats cannot use linoleic acid for the biosynthesis of arachidonic acid [48]. Further analysis of these genes may provide clues about feline biochemistry associated with arachidonic acid which may be important in feline reproduction [36]. Finally, we identified genes involved in the metabolism of retinol, which represent another very important gene set because cats are unable to synthesize retinol from beta-carotene [50].
The metabolism and biosynthesis of cofactors, vitamins and glycans is important in the nutrition and health of animals. Within these biochemical pathways, we identified three genes associated with folate metabolism, seven genes involved in glutathione metabolism and two genes associated with keratin sulfate biosynthesis, two genes associated with N-glycan biosynthesis and three genes associated with pathothenate and CoA biosynthesis. Some of these genes may provide value as important biological markers for monitoring oxidative stress, apoptosis and immune function in cats [51].
Collectively, many of these genes and their associated pathways are important for feline health and nutrition because they represent biochemical processes that cats have adapted to accommodate the narrow dietary range of an obligate carnivore in contrast to omnivorous mammals. The subsequent characterization of these genes and pathways may provide a genomic foundation for understanding how obligate carnivores differ from other animals in both health and disease.
Our functional and evolutionary analysis suggests that through divergent evolutionary trajectories, different species evolve slightly different biochemical processes of cells, tissues and organs that contribute to the manifestation of species specific adaptations and disorders. The domestic cat is known to suffer from a number of hereditary diseases, many of which have counterparts in other species like humans and dogs [52]. As part of our investigation into the biological significance of our cDNA sequences, we employed a comparative genomics approach to discover the phenotypes associated with these sequences. Our approach leveraged the mammalian phenotype ontology that has been developed as part of the mouse genome database [40]. We decided to select a relatively small number of genes for which a considerable number of important phenotypes may be associated.
Our phenotype data was obtained from previously published mouse phenotyping studies using transgenic or knockout mice. Subsequently, they should be considered as related to, rather than exactly, the true phenotypes that might arise in the cat. Because our method relies upon orthologous relationships between cat and mouse genes, it is worthwhile to point out that inaccurate mappings between orthologs may lead to inaccurate predictions of phenotypes. Furthermore, as we have described throughout this paper, the cat exhibits some strong similarities to general biological processes that are shared with mammals. The cat also has well documented differences when compared to omnivorous animals. Therefore, one must consider the phenotype analysis as a general thematic picture of the functional consequences of our cDNA sequences rather than as a one-to-one mapping of gene-phenotype associations within our cDNA sequences.
We identified seven phenotypic modules exhibiting 136 phenotypes arising from only 38 genes. Many of the genes we identified exhibit numerous phenotypes, both within and across modules. Such pleiotropic effects underlie the complexity of mammalian genomes and provide context for future genomic studies. We selected these gene-phenotype associations to provide a detailed, but yet tractable picture of how our cDNA sequences might map onto anatomical and physiological traits.
Within the cardiac module, we identified eight genes associated with phenotypes relating to cardiac disease in cats. Some of the genes within this module include tropomodulin 1, snail homolog 1 and an interleukin receptor antagonist. This module includes phenotypes of cardiac hypertrophy and mitral valve defects, both of which are known hereditary diseases in cats [53]. These genes provide examples of the types of phenotypes that might arise from perturbations of cat genes underlying inherited feline cardiac diseases, such as aortic stenosis, atrial-septal defect, mitral valve displasia, tetralogy of Fallot and ventricular-septal defect [53,54].
Our developmental module consists of seven genes and includes a TGFbeta induced homeobox transcription factor as well as the signaling molecule argininevasopressin. The phenotypes associated with this module include developmental patterning across both the proximal/distal axis and the rostral/caudal axis. The phenotypes also include cellular specification and patterning such as mesoderm development, trophoblast layer morphology and adipose tissue differentiation, to name a few. Domestic cats exhibit a variety of developmental defects, such as polydactyly, hip dysplasia, sacrococcygeal dysgenesis, portocaval shunt, open central fontanel, open lateral fontanel and thoracic hemivertebra [54][55][56][57]. The cDNA sequences we describe may include genes that are responsible for abnormal developmental conditions in domestic and endangered felids.
We identified a sensory module, which contains five genes such as NADH dehydrogenase (ubiuinone) Fe-S protein 4 and caspase 9 apoptosis-related cysteine pepidase. This module includes the phenotypes of cataracts, blindness and optic nerve atrophy. Examples of inherited sensory system disorders in the domestic cat include cataracts, corneal dystrophy (stromal and endothelial), progressive retinal atrophy and glaucoma [58]. The overlap between retinal and ocular phenotypes and inherited feline diseases suggests that there are specific genomic regions, represented by our cDNA sequences, which may include aspects of the genetic mechanisms of these debilitating diseases in cats. It is interesting to note that our sensory module includes genes involved in energy production. This is not surprising as retinal tissue is known to exhibit relatively high energy requirements and depletion of energy in this tissue has been associated with blindness and other vision defects [50].
Within our energy and homeostasis module, we identified genes like glycerol kinase 2, NAD(P)H dehydrogenase quinone 1 and NADH dehydrogenase (ubiquinone) Fe-S protein 4. The phenotypes within this module are associated with traits of clinical and adaptive importance in the cat. For example, our comparative phenotype analysis identified phenotypes of insulin resistance, increased circulating insulin level and impaired glucose tolerance; traits associated with the feline hereditary disease of diabetes mellitus [59]. This module also contains phenotypes such as abnormal gluconeogenesis, increased glucagon, abnormal glucose homeostasis and increased circulating ammonia level, which are important in felid nutrition as cats use gluconeogenesis as a predominant form of energy production and are susceptible to ammonia toxicity [17,18]. The genes in this module are of value in exploring some of the fundamental metabolic and biochemical differences between obligate carnivores and omnivores. Moreover, these genes may provide a genomic basis for specific diets that can reduce the incidence of feline disorders associated with specific nutritional deficiencies.
Within other modules, we identified phenotypes associated with cancer, such as increased tumor incidence, malignant tumors and B-cell derived lymphoma which may provide clues to the genetic susceptibility cats have for hereditary lymphoma [60]. Among the behavioral phenotypes within the nervous system module, we identified a number of traits that may represent predator specific adaptations of cats. For example, we identified cDNA sequences associated with spatial learning, balance, righting response, gate and motor coordination; traits that are almost synonymous with cats and of extreme adaptive value for an apex hyper predator.
The comparative genomics analysis of OMIM diseases within our cDNA sequence data set provides a final perspective on the importance of our reported sequences in the health of domestic cats. Many of the diseases identified in the OMIM mapping are also represented by phenotypes within the modules. This independent annotation demonstrates that our analysis converges even though OMIM analysis leverages human orthology relationships and the phenotype analysis leverages murine orthology relationships. It is worth noting the limitation of sequence based comparative genomics approaches. They can provide considerable insight into the functional role of our cDNA sequences, but must ultimately be proven through focused and carefully designed genomics studies in cats. Nonetheless, our cDNA sequences and associated analysis provide considerable value through the identification of many interesting clinically and nutritionally relevant feline genes.
The set of diseases and phenotypes provides a starting point for candidate gene approaches and for the selection of biomarkers for monitoring nutrition and health. By combining diverse types of annotation, we can better understand the function of a given gene in a breadth of tissues and organ systems and of the biological processes it is involved in the organismal level, as well as its role in disease. For example, we identified genes associated with expression in the heart, and with a number of cardiac phenotypes, including cardiac hypertrophy, abnormal outflow tract and abnormal mitral valve morphology, as well as the OMIM disease annotation of dilated cardiomyopathy. These are of direct relevance to feline disease, since hypertrophic cardiomyopathy is a common clinical concern in cats [53].
The recent development of a 70,000 SNP feline bead array by Hill's Pet Nutrition and the Morris Animal Foundation provides an important and powerful resource for conducting gene association studies in the domestic cat, and related endangered species. However, even in the absence of whole-genome genetic association approaches, our characterization of these 1227 cDNA sequences provides an extremely valuable resource for candidate gene approaches aimed at investigating the genetic basis of feline phenotypes. It will be interesting to see how our comparative and functional analysis of these 1227 cDNA sequences compares to the data produced from high throughput sequencing and future genetic studies within and across different breeds in the domestic cat. It is likely that some of our functional annotations may turn out not to hold, and it is equally likely that some of them will. Through collaborative efforts, it will be possible to begin unravelling the genetic mechanisms underlying feline health and disease.

Conclusions
We report the identification of 1227 feline cDNA sequences of which, 913 correspond to higher quality versions of public feline sequences and 314 correspond to novel feline sequences for which no known public sequence data exists. Our comprehensive functional analysis identified a number of physiologically important biochemical pathways that these sequences are involved in as well as of the developmental, clinical and nutritional relevant phenotypes they are associated with.

Construction of feline tissue specific cDNA libraries
The study protocol was reviewed and approved by the Institutional Animal Care and Use Committee. All cats were immunized against feline panleukopenia, calici, rhinotracheitis, and rabies. Cats were housed with 10 -12 other cats and food was continuously available throughout the day until their daily caloric requirements were consumed. Cats were housed in spacious rooms with natural light that varies with seasonal changes. Cats experienced behavioral enrichment through interactions with each other, by daily interaction and play time with caretakers, large windows and sun porches to watch the natural landscape and access to toys. At the end of their natural life, cats were euthanized for humane purposes and tissues were stored at -80C.
Total RNA was purified from 21 feline tissues (brain, kidney medulla/cortex, spleen, heart, liver, lung, skeletal muscle, thyroid gland, lymph node, pancreas, adrenal gland, tongue, colon, mammary gland, neonatal thymus, brain and testes) collected from 10 domestic shorthaired cats postmortem, three cell lines derived from kidney, brain, lung, and 1 tissue pool using standard procedures as described in [48]. The purity and integrity of each RNA sample was assessed by spectrophotometry and gel electrophoresis. Forty normalized cDNA libraries were constructed by Agencourt Inc. (Beckman-Coulter Genomics), 22 with standard inserts (1.2 kb) and 18 with long inserts (> 4 kb). The first and second cDNA strands were synthesized using optimized methods, and cDNAs were size selected prior to cloning. The size-selected cDNAs were directionally cloned into the pAGEN vector by polishing and restriction digest, creating a 5' blunt end and a 3' overhang.
Each cDNA library was subsequently tested for specific quality control measures (average insert size, number of independent clones and percentage of recombinant clones), and normalized to reduce the proportion of highly abundant mRNAs. Normalization was performed by dividing each library into two populations, using the first for in vitro transcription of biotinylated RNA, and the second to generate single stranded phagemid DNA. The two populations were then mixed, and self-hybridized DNA-RNA molecules corresponding to overrepresented mRNAs were removed. The remaining single stranded DNA molecules were primed for second strand synthesis and the resulting clones were transformed into bacteria, yielding the normalized libraries.

Sequencing of feline cDNA libraries
Plasmids were purified from each library using a largescale automated protocol, the SprintPrep ® Solid Phase Reversible Immobilization procedure. Sequencing reactions were performed in 384-well plates using BigDye ® Version 3.1 direct cycle sequencing (Applied Biosystems, CA). Sequencing reactions were purified using the CleanSeq ® dye-terminator removal kit (Agencourt, Inc.), and resolved by capillary electrophoresis using the ABI3730 Genetic Analyzer (Applied Biosystems, CA). Sequencing reads were processed using Phred and quality scores for each run were monitored using the Agencourt, Inc. Galaxy LIMS system. Sequencing of these cDNA libraries yielded a total of 919,676 EST reads.

Data Management and Analysis
The sequence data, annotation data and the data resulting from sequence analysis were loaded into the MySQL relational database version 5 to facilitate data management and analysis [61].

Sequence Filtering and Ortholog Detection
A set of 3035 full length feline cDNA sequences were obtained from the analysis of the sequencing data and used to identify a set of high confidence cDNA sequences. All cDNA sequences were translated in 6 reading frames and the longest protein coding sequence obtained was noted. These cDNA and protein sequences were clustered using blast to identify a set of nonredundant nucleotide and non-redundant protein sequences using a stringency of 95% or greater as criteria for identifying redundant sequences. For each cluster, the longest representative sequence was chosen as the non-redundant representative. The intersection of non-redundant nucleotide sequences and non-redundant protein sequences was used as the set of non-redundant sequences.
The BLAST programs, blastp and blastn [62,63], were run with the non-redundant full length feline sequences as query and the target species sequences downloaded from ENSEMBL ftp://ftp.ensembl.org/pub/current/fasta/ [36]  . Because the human sequence sets contain the greatest number of target sequences, 147,141 nucleotide sequences and 81,968 protein sequences, the set of non-redundant sequences were mapped to the human sequences. Additionally, the full length sequences were mapped to the set of known feline cDNA and protein sequences in order to classify the full length non-redundant feline sequences as either known or novel, where known indicates that the sequence is represented by a feline sequence in the public ensembl transcript/protein sequence data while novel indicates that the sequence does not have a representative transcript or protein sequence in the ensembl data set.
Because the public feline data does not contain all of the protein coding genes, it was not possible to perform an ortholog search using the standard reciprocal best hit approach. Instead, the blast results were filtered using an iterative heuristic process of selecting blast hits with specific match lengths, gaps, number mismatches and percent identity. In total, eight iterative steps were performed beginning with the most stringent and ending with the least stringent. Each step identified a set of qualifying non-redundant full length sequences. The first and most stringent step imposed the requirement that the blast match_length must be equal to the smallest of the two sequences (query or subject) and the number of mismatches = 0, number of gaps = 0, and the percent identity ≥ 99%. A second filter was used to add additional sequences to the results of the first step, and any sequences that had not been identified in the first step were added to the set of results. The second step used a blast match_length ratio of ≥ 0.99, number mismatches = 0, number gaps = 0 and percent identity ≥ 99%. A third step identified additional sequences that satisfied the third step criteria and for which the first two steps did select the non-redundant full length sequence. The third step criteria were blast match length ratio ≥ 0.87, number of mismatches ≤ 4, number of gaps = 0, and percent identity ≥ 99%. The iterative process continued for a total of eight steps with each subsequent step relaxing the filtering criteria in order to identify sequences that were not identified in the previous step. Fourth step criteria were blast match length ratio ≥ 0.725, number mismatches ≤ 5, number of gaps = 0, and percent identity ≥ 99%. Fifth step criteria were blast match length ≥ 0.69, number mismatches ≤ 4, number gaps ≤ 1 and percent identity ≥ 99%. Sixth step criteria included blast match length ≥ 0.625, number mismatches ≤ 8, number of gaps ≤ 1, and percent identity ≥ 98%. Seventh step criteria included blast match length ratio ≥ 0.575, number mismatches ≤ 13, number gaps ≤ 2 and percent identity ≥ 97%. The eighth step criteria were blast match length ≥ 0.52, number mismatches ≤ 12, number of gaps ≤ 2 and percent identity ≥ 97%.
The resulting set of non-redundant full length sequences were considered to represent the high quality feline cDNA and protein sequences. These high quality sequences which mapped to a known public feline sequence were used to generate global nucleotide and protein alignments using the partial order alignment software POA http://bioinfo.mbi.ucla.edu/poa2/ POA_Online/Align.html [64,65]. All alignments were manually inspected to ensure that each non-redundant full length feline sequence mapped to the correct public feline sequence.

Comparative Expression Analysis
In order to infer anatomical and cellular expression patterns of our sequences, four expression annotation files were downloaded from the public biomart http://www. biomart.org [66] web server. Because we mapped our sequences to their corresponding human orthologs, we downloaded the human biomart egenetics annotation data sets mapped on top of the ensembl gene 60 version human gene identifiers. The four annotation sets we obtained included human ensembl gene identifiers mapped to (1) a set of anatomical terms, (2) a set of cell types, (3) a list of pathological terms and (4) a list of developmental stages ranging from weeks to years.
Although our sequences represent a subset of gene products, we found value in identifying the spectrum of expression patterns these sequences may exhibit beyond the tissue libraries that we used. The mapping was accomplished by loading each of the four gene expression annotation files into the MySQL relational database and performing SQL queries that joined these expression tables to our orthologous gene set using the ensembl human gene identifier.

dN/dS Codon Substitution Rate Calculations
In order to better understand the evolutionary relationships between the feline cDNA sequences and the orthologous sequences in dog, human and mouse, we calculated dN/dS values for orthologous sequences across the different species. Phylogenetic Analysis by Maximum Likelihood (PAML version 4.4) software was used to run the codon stats using the "codeml" program. Codon stats were computed where it was possible (DNA and protein sequence availability for the orthologs) with basic model (NSSites = 0) ω = dN/dS, the ratio of nonsynonymous/synonymous substitution rates The ω ratio is a measure of natural selection acting on the protein. Simplistically, values of ω < 1, = 1, and > 1 means negative purifying selection, neutral evolution, and positive selection respectively. PAL 2NAL [67][68][69] was used to create codon alignments between the cDNAs and the proteins to input to PAML program which computes the dN, dS and ω ratio. Codon substitution rate data was loaded into the MySQL relational database and used to assess the evolutionary pressure exerted on specific groups of genes. The gene groups were derived from other annotation types, such as gene ontology and phenotype annotation results.

Gene Ontology Annotation
Gene ontology annotation was added to the set of orthologous sequences via comparative sequence analysis. Because the orthologous sequences were already mapped to the human transcripts and proteins, we decided to download the gene ontology annotation files corresponding to biological process, molecular function and cellular localization in order to annotate the nonredundant full length feline sequences with gene ontology terms [70]. The gene ontology annotation files linking the gene ontology terms to the ensembl human gene identifiers were obtained from biomart http://www. biomart.org. Each feline sequence we identified was annotated with all of the gene ontology terms associated with the orthologous human gene. In this manner, we were able to identify a larger set of gene ontology annotations per feline gene than we could have accomplished if we limited the annotation mapping to only the feline cDNA sequences we identified. Through this greedy algorithm, we were able to gain a more comprehensive understanding of the genes we identified. SQL queries in MySQL database were used to map the human gene ontology annotation terms to the orthologous feline genes encoding the cDNA sequences we identified.

GeneGO Metabolic Network Analysis
Metabolic networks of feline sequences (using ENSEMBL id to upload) were performed using the MetaCore software (GeneGO, St. Joseph, MI). MetaCore identifies networks based on a manually curated database containing known molecular interactions, functions, and disease interrelationships. The networks are identified by the probability that a random set of genes the same size as the input list would give rise to a particular mapping by chance. Therefore, an enrichment of biological relevant pathways or networks can be found.

KEGG Pathway Annotation
Pathway associations were identified using a comparative genomics approach. Because our orthologous sequences were mapped to human orthologs, it was possible to use the human pathway association information to map the pathways on the orthologous feline sequences. This was accomplished using SQL queries to join the KEGG http://www.genome.jp/kegg/pathway. html [71] and Biocarta http://www.biocarta.com/genes/ index.asp pathway data [72] that has been associated with human ensembl gene identifiers with feline gene identifiers. Additional pathways were identified using the David Bioinformatics Database http://david.abcc. ncifcrf.gov/ [73] through a gene set search using the ensembl gene identifiers for the set of human orthologs of the feline sequences we identified.

Comparative Phenotype Mapping
Gene specific phenotype annotation derived from mouse knockouts and/or transgenic strains is compiled and made publicly available at the Mouse Genome Database http://www.informatics.jax.org/ [40]. The phenotype annotation is structured within the mammalian phenotype ontology which provides an acyclic graph of mammalian morphological and physiological phenotypes. Because the mouse phenotype data is associated with each mouse gene, it was possible to link the mammalian phenotype ontology to the feline nonredundant full length sequences through a two step process. First, the mammalian phenotype annotations linked to mouse gene identifiers were obtained and loaded into the MySQL database. Next, the appropriate SQL query was performed which created a table that joined the phenotype information with our feline sequence data. The resulting phenotype annotations on top of the feline orthologous gene set provide an additional mechanism for understanding the role of these cDNA sequences in cat development, health and disease.

OMIM Disease Mapping
A comparative genomics map of our feline sequences annotated with the OMIM http://www.ncbi.nlm.nih.gov/ omim [74] disease information was generated using two different approaches. The first approach utilized MIM disease data that was produced from biomart and anchored to the human ensembl gene identifiers. The resulting annotation file was loaded into the relational database and an appropriate SQL query was used to connect the disease information to the feline sequences through the orthologous relationships that were previously determined. The resulting mapping provided formal associations between feline cDNA sequences and OMIM disease information http://www.ncbi.nlm.nih. gov/omim.
A second method of mapping the feline sequence data to the OMIM data was used to improve the set of OMIM annotated feline cDNA sequences. Specifically, the set of human ensembl gene identifiers corresponding to the orthologs for the feline cDNA sequences were used to query the David Bioinformatics database for OMIM disease information. The resulting file downloaded from the David Database contained human ensembl gene identifiers and OMIM disease identifiers. This file was loaded in the MySQL database and linked with the non-redundant feline cDNA sequences using an appropriate SQL query.

Additional material
Additional file 1: The cDNA and protein sequences and other information corresponding to the 1227 identified feline sequences. This table lists both the cDNA and protein sequences and corresponding lengths for each of the 1227 feline sequences we identified, along with the designation novel (314 sequences) or known (914 sequences). Sequence Identifier (unique identifier for each of the 1227 nonredundant feline sequences). Status (known denotes sequences that map to public feline sequences and novel denotes sequences with no identity to public feline sequences). cDNA Sequence Length (nucleotide length of cDNA sequence). cDNA Sequence (nucleotide sequence of cDNA). Protein Sequence Length (amino acid length of protein sequence). Protein Sequence (protein sequence corresponding to longest translation product of the cDNA).
Additional file 2: Sequences and other information on orthologous sequences from the dog, human and mouse. This table contains the 914 orthologous sequences (844 known and 70 novel) and the corresponding ensembl gene, transcript and protein identifiers for the dog, human and mouse orthologs. Status (known denotes sequences that map to public feline sequences and novel denotes sequences with no identity to public feline sequences). Symbol (gene symbol). Title (gene title). Sequence Identifier (unique identifier for each non-redundant feline sequence). Cat Gene Id (cat ensembl gene identifier for sequences designated as known). Cat Transcript Id (cat ensembl transcript identifier for sequences designated as known). Cat Protein Id (cat ensembl protein identifier for subset of sequences designated as known). Dog Gene Id (ensembl gene identifier of dog ortholog). Dog Transcript Id (ensembl transcript identifier of dog ortholog). Dog Protein Id (ensembl protein identifier of dog ortholog). Human Gene Id (ensembl gene identifier of human ortholog). Human Transcript Id (ensembl transcript identifier of human ortholog). Human Protein Id (ensembl protein identifier of human ortholog). Mouse Gene Id (ensembl gene identifier of mouse ortholog). Mouse Transcript Id (ensembl transcript identifier of mouse ortholog). Mouse Protein Id (ensembl protein identifier of mouse ortholog).
Additional file 3: Feline Genes Mapped to OMIM Diseases. This table contains a set of 104 feline cDNA sequences that were mapped to their corresponding human orthologs and the associated OMIM diseases. The first column indicates the cDNA identifier, the second column contains the ensembl human gene identifier for the orthologous human gene, and the third and fourth columns contain the OMIM disease identifier and the disease name respectively. Sequence Identifier (unique identifier for each non-redundant feline sequence). Human Gene Id (ensembl gene identifier of human ortholog). OMIM Identifier (OMIM id for a specific human disease). Disease Name (the name of the disease from the OMIM database).