First survey and functional annotation of prohormone and convertase genes in the pig

Background The pig is a biomedical model to study human and livestock traits. Many of these traits are controlled by neuropeptides that result from the cleavage of prohormones by prohormone convertases. Only 45 prohormones have been confirmed in the pig. Sequence homology can be ineffective to annotate prohormone genes in sequenced species like the pig due to the multifactorial nature of the prohormone processing. The goal of this study is to undertake the first complete survey of prohormone and prohormone convertases genes in the pig genome. These genes were functionally annotated based on 35 gene expression microarray experiments. The cleavage sites of prohormone sequences into potentially active neuropeptides were predicted. Results We identified 95 unique prohormone genes, 2 alternative calcitonin-related sequences, 8 prohormone convertases and 1 cleavage facilitator in the pig genome 10.2 assembly and trace archives. Of these, 11 pig prohormone genes have not been reported in the UniProt, UniGene or Gene databases. These genes are intermedin, cortistatin, insulin-like 5, orexigenic neuropeptide QRFP, prokineticin 2, prolactin-releasing peptide, parathyroid hormone 2, urocortin, urocortin 2, urocortin 3, and urotensin 2-related peptide. In addition, a novel neuropeptide S was identified in the pig genome correcting the previously reported pig sequence that is identical to the rabbit sequence. Most differentially expressed prohormone genes were under-expressed in pigs experiencing immune challenge relative to the un-challenged controls, in non-pregnant relative to pregnant sows, in old relative to young embryos, and in non-neural relative to neural tissues. The cleavage prediction based on human sequences had the best performance with a correct classification rate of cleaved and non-cleaved sites of 92% suggesting that the processing of prohormones in pigs is similar to humans. The cleavage prediction models did not find conclusive evidence supporting the production of the bioactive neuropeptides urocortin 2, urocortin 3, torsin family 2 member A, tachykinin 4, islet amyloid polypeptide, and calcitonin receptor-stimulating peptide 2 in the pig. Conclusions The present genomic and functional characterization supports the use of the pig as an effective animal model to gain a deeper understanding of prohormones, prohormone convertases and neuropeptides in biomedical and agricultural research.


Background
In addition to the importance in livestock production, the pig is a well-established biomedical model to study human health due to the genomic, anatomic and physiologic similarities with humans. A wide variety of human health traits including cancer, reproductive health, drug metabolism, wound healing, and cardiovascular disease have been successfully studied using the pig [1][2][3][4]. Underlying these and other important traits such as growth and development, feeding, stress, memory and susceptibility to substances abuse are neuropeptides, a class of cell-cell signaling peptides that have paracrine, endocrine, and autocrine effects [5,6]. Research in selected pig neuropeptides has offered insights into cell transplantation, nervous system diseases, and injury [7]. For example, Yang et al. [8] reported on the inhibitory effect of neuromedin S (NMS) on luteinizing hormone secretion which is mediated via melanocyte-stimulating hormone (MSH) neurons in the hypothalamus-pituitary axis of ovariectomized pigs. Kaminski et al. [9] concluded that hypothalamic peptides, orexin A and orexin B, were involved in the control of food intake, sleep patterns, autonomic and neuroendocrine systems, and may also affect reproductive functions through the influence on the hypothalamic-pituitaryovarian axis in pigs.
The identification of neuropeptides is more difficult than most proteins because neuropeptides are derived from larger prohormone proteins as a result of complex enzymatic processing. The conversion of the large prohormone to one or multiple smaller neuropeptides involves cleavage by multiple prohormone convertases and additional post-translational modifications such as amidation and glycosylation [5]. This complex processing of prohormones into neuropeptides challenges the identification of neuropeptide genes across genomes solely based on sequence homology to better understood species [5,6,10].
Only 40 prohormone and 2 prohormone convertase genes have been empirically confirmed in the pig compared to approximately 100 genes identified in human, rat, mouse, cow and rhesus monkey [11][12][13]. This partial knowledge of the prohormone, prohormone convertase and associated neuropeptides in the pig is a critical shortcoming, especially considering the important role of pig in biomedical and agricultural research [1]. In addition, few gene expression studies have discussed the expression profile of pig prohormone or prohormone convertase genes. Ross et al. [14] found that estradiol treatment at day 9 of gestation was associated with changes in the expression of the prohormone neuromedin (NMB) in the endometrium of glits. Hausman et al. [15] concluded that the expression of neuropeptide Y (NPY) was down-regulated with age in gilts ranging from 90 to 210 days old.
Understanding the role of neuropeptides in human and livestock traits using the pig as biomedical model requires a comprehensive knowledge of the neuropeptide complement in the recently released pig genome (SScrofa 10.2v18, [16]). This understanding includes the identification of prohormone and prohormone convertase genes, prediction of cleavage sites in prohormones that may result in potentially bioactive neuropeptides, and characterization of gene expression and protein abundance across conditions to gain insights into the role of neuropeptides. A complete survey of the prohormone and prohormone convertase genes in the pig supports the interpretation of gene expression experiments and improves the effectiveness of tandem mass spectrometry studies to identify neuropeptides [17][18][19]. Functional annotation of these genes can be obtained by the analysis of the large number of gene expression experiments already available [20,21]. To address the lack of a comprehensive understanding of the prohormone and prohormone convertase genes in the pig, we present the first comprehensive survey and functional annotation of these genes. An all-inclusive catalogue of prohormone and prohormone convertase genes known in other species was used to search complementary pig genome databases. These genes were then characterized by analyzing a large number of gene expression experiments across a wide range of conditions. The potential cleavage sites of prohormones that can result in bioactive neuropeptides were predicted and compared to the cleavages based on known neuropeptide sequences.

Results and discussion
Pig prohormone genes A comprehensive catalogue of 95 potential pig prohormone genes, 8 prohormone convertase genes and 1 prohormone convertase facilitator gene (7B2) were identified in the pig genome. Table 1 lists the genes and the corresponding BLAST matches on the pig Genome, UniProt, Gene and UniGene databases. There are 11 previously unreported (i.e. without empirical confirmation) prohormone genes in the pig and complete sequences where only partially or incomplete sequences have been previously reported. Newly identified genes are intermedin (ADM2), cortistatin (CORT), insulin-like 5 (INSL5), orexigenic neuropeptide QRFP (OX26), prokineticin 2 (PROK2), prolactin-releasing peptide (PRRP), parathyroid hormone 2 (TIP39), urocortin (UCN1), urocortin 2 (UCN2), urocortin 3 (UCN3), and urotensin 2-related peptide (UTS2B). This search also identified two different calcitonin protein entries in public databases that are isoforms of other calcitonin genes. Additional information on the comprehensive catalogue of genes is available [see Additional file 1]. The predicted prohormone protein sequences with cleavage sites identification is provided in the NeuroPred format [see Additional file 2]. Table 2 summarizes the evidence from various repositories supporting the 95 unique pig prohormone genes and 2 alternative sequences detected in this study. The genome-predicted prohormone genes are supported by 66 UniProt entries (Table 1) including 47 sequences that have evidence at the protein level of which 39 have complete protein sequence and 8 have partial protein sequence. Additionally, 81, 91 and 19 prohormone genes detected are supported by transcript evidence from UniGene, Gene, and UniProt, respectively. Two genes, apelin (APEL) and PROK2, are supported by evidence in UniProt and by EST records unassigned to a UniGene cluster. UniProt supporting information includes 2 entries from alternatively spliced calcitonin genes, preprocalcitonin gene-related peptide and calcitonin-2, that have evidence at the protein and transcript levels, respectively. However, the calcitonin 2 is not assigned to a UniGene cluster or NCBI Gene identifier because no EST matches the unique region of the reported sequence. The UniProt entry for prothyroliberin (TRH) refers only to the 3-amino acid  thyroliberin peptide however, no pig EST has been reported. Only 8 genome predictions do not have supporting evidence in UniProt with 5 of these have supporting transcript evidence. Complete protein sequences were predicted for insulin-like 5 (INSL5) and endothelin-2 (EDN2) and an incomplete protein sequence was predicted for prolactin-releasing peptide (PRRP) genes. The nucleic and protein sequence of EDN2 have been reported [22] but this information is not present in public databases. Of the 23 UniProt predicted sequences, 14 (7) entries have (do not have) transcript support. There are 2 UniProt predicted sequences that correspond to genes that are absent in the pig genome studied. A partial match to UCN1 was found in the trace archives although this gene was not found in the pig genome assembly studied and lacks of external validation. Progonadoliberin-2 (GON2) was not found in the current pig genome assembly although this gene was detected in earlier assembly versions and trace archives. This gene has no current experimental evidence outside homology to other mammalian species. The apparent lack of UCN1 and GON2 in the assembly and fragment evidence of other prohormones is most likely due to poor coverage of the genomic regions where these prohormone genes are located. At least four calcitonin genes, also known as calcitonin receptor-stimulating peptide genes, were identified with 2 genes exhibiting alternative splicing. The genome assembly permitted the assignment of the UniProt pre-procalcitonin (A6P7L6) and preprocalcitonin generelated peptide (A6P7L7) entries to the same gene that also produces the UniProt calcitonin (CALC, [Swiss-Prot: P01259]) and calcitonin gene-related peptide (CALCA, [Swiss-Prot:P30880]) peptides, respectively. This alternatively spliced gene corresponds to CALCA gene found in other mammalian species.
The bioinformatics gene prediction pipeline confirmed that the separate Q766Y6 and A0A761 UniProt entries are alternatively spliced variants of the same calcitonin receptor-stimulating peptide 3 (CRSP3) gene as initially reported by Rezaeian et al. [23]. A single genome match was identified for calcitonin receptor-stimulating peptide 2 (CRSP2, Q766Y7). While the calcitonin receptorstimulating peptide 1 (CRSP1) gene has been reported, the actual gene may be inaccurately assembled in the 10.2 genome release because the region appears to contain a small duplication leading to 2 starting locations. Further supporting this argument, a conserved 19 amino acid region in all calcitonin receptor-stimulating peptide-related protein sequences matched to an additional 5th genome site which was part of a discontinued NCBI Gene entry (Gene ID 100624618). There is insufficient information to conclude whether there is a separate coding gene involved or an assembly-related problem.
Neuropeptide S (NPS) is a potential 12th prohormone gene discovered by the bioinformatics gene prediction pipeline. Two genome matches on different chromosomes for the NPS gene were identified. However, the complete identity of the matched between the predicted sequence and chromosomal locations implied that this was an assembly error rather than a duplication event. Although UniProt has an partial pig NPS entry (B5M997), the genome predicted protein sequence was more similar to the bovine sequence, as expected, than the present partial UniProt pig sequence. The pig NPS protein and nucleic sequences were found to be 100% and 99% identical, respectively, to the rabbit sequence ([GenBank:EU978456]). The similarity between the UniProt pig and rabbit sequences was also evident in the phylogenetic relationships among NPS sequences reported by Yao et al. [24]. These findings call into question the present pig NPS entry in UniProt.

Prohormone genes previously unreported in pig
The genome search identified 11 prohormone genes that do not have empirical confirmation in the UniProt, UniGene or Gene databases (Table 1). These genes are ADM2, CORT, INSL5, OX26, PROK2, PRRP, NPS, TIP39, UCN1, UCN2, UCN3, and UTS2B. Only inferred sequences are available for ADM2 and CORT in UniProt and the current pig NPS entry is identical to the rabbit NPS. There is evidence for mammalian homologs of all these genes in UniProt. The protocol followed to identify these genes included a high percentage of identities and similarities with a minimum percentage of mismatches and gaps and conservation of the region encompassing the potential neuropeptide. Intermedin or adrenomedullin 2 (ADM2) is part of the calcitonin family of peptides and has effects similar to those of adrenomedullin (ADML). In humans, intermedin causes hypotension when given peripherally and augments blood pressure and causes sympathetic activation when given to the central nervous system [25]. This neuropeptide induces prolactin release, has antidiuretic and natriuretic properties and reduces food intake. The amino acid sequences of CORT and somatostatin (SMS) are highly similar and both reduce neuronal activity. In addition, CORT has unique roles such as induction of slow-wave sleep, reduction of locomotor activity, and activation of cation selective currents not responsive to SMS [26]. Although the function of INSL5 is still being determined, high expression in the colon, as well as in the brain and hypothalamus, indicates roles in gut contractility and neuroendocrine signaling [27]. Likewise, the function of OX26 is still being elucidated, although studies in chicken confirm the orexigenic, appetite stimulating activity of this neuropeptide [28]. Takayanagi and Onaka [29] demonstrated that PRRP plays a role in control of energy metabolism and stress response. Prokineticins are involved in tumorigenesis process (prostate, testicles, neuroblastoma, colon, and pancreas) acting as a growth factor for cancer cells, an angiogenic and a chemotactic factor for pro-inflammatory neutrophils [30]. NPS has anxiolytic-like effects (stress reduction) and can induce arousal and wakefulness [31]. TIP39 and the corresponding receptor form a neuromodulator system and the anatomical distribution indicates a role in limbic, endocrine, viscerosensory, and auditory functions. This system has been postulated as potential drug target in anxiety, depression and chronic pain management [32]. Urocortins and their receptors has been found in the central nervous, digestive, reproductive, cardiovascular, immune and endocrine systems, suggesting a variety of roles including cardiovascular activity and cell survival [33]. UTSB2 is a paralog of urotensin 2 (UTS2) that exerts similar biological effects including relaxation of muscles and reduction of blood pressure [34].

Pig prohormone convertase genes
The sequence of 8 prohormone convertase genes and the 7B2 facilitator gene also known as secretogranin 5 (SCG5) were identified in the pig genome ( Table 1). The UniProt and Gene databases only had supporting evidence for PCSK1, PCSK2, and 7B2. Six additional prohormone convertase genes (furin, PCSK4, PCSK5, PCSK6, PCSK7, and PCSK9) that were previously unreported or not based on empirical evidence were identified. Only transcript evidence supports the 8 prohormone convertase genes, meanwhile protein evidence is available for 7B2. Dai et al. [35] isolated PCSK1 from the ovary cDNA library of a pregnant sow and Renegar et al. [36] detected PCSK1 in the corpus luteum and brain of pregnant sows. Also, mRNA from PCSK1 and PCSK2 has been identified in the pituitary neurointermediate lobes of pigs [37]. Among the prohormone convertases, furin, PCSK4, PCSK5, PCKS6 and PCSK7 do not have UniGene entries. The present catalogue enhances the currently limited work on pig prohormone convertases.
Functional characterization of the pig prohormone and prohormone convertase genes Analysis of the large number of microarray gene expression experiments enabled the first comprehensive characterization of the role of prohormone and prohormone convertase genes in biological processes in the pig. The results from these analyses augmented the understanding of the role of these genes on reproduction, health, growth, and other traits of importance to biomedical research and agricultural production. The query of Affymetrix Porcine Genome Array identified 77 probes representing 56 prohormone and 3 prohormone convertase genes. Table 3 lists the total number of differentially expressed probes (P-value < 0.005) within the seven experimental classes considered. A detailed distribution of the differential expression level of each probe and experiment is provided [see Additional file 3]. A discussion of the findings for the 3 groups with highest number of differentially expressed probes (immune-related, embryo and placenta, and brain and central nervous system) is presented below. Although neuropeptides expressed in the brain and the immune system interact with circulating cytokines to support two-way communications between the brain and immune system [38], we describe the profiles of prohormones in immune-related tissues separately from the brain and central nervous system tissues to facilitate the interpretation of results.

Immune-related profiling
Several studies have demonstrated that prohormone genes play an important role in pig immune response [39]. This was evidenced by the high number of differentially expressed prohormone and prohormone convertase genes (24 genes) among experiments that evaluated immuneresponse in blood, spleen, and lymph nodes (Table 3) In general, prohormone genes were under-expressed in pigs under immune challenge relative to the unchallenged controls. AUGN was differentially expressed in two experiments; GSE7313 [40] that profiled lymph nodes and GSE14790 [41] that profiled blood. In GSE14790, 7 day-old pigs were inoculated with porcine circovirus type 2 (PCV2), a virus that is widely spread across pig farms, and gene expression was profiled at 0, 7, 14, 21 and 29 dpi. AUGN was over-expressed in un-inoculated pigs at 29 dpi relative to 7 dpi, regardless of inoculation and relative 21 dpi inoculated pigs (P-value < 2.5 × 10 -4 ). Both contrasts indicate that the expression of AUGN increases with age and this trend is slower in pigs infected with PCV2. In GSE7313, the gene expression of seven week old piglets inoculated with Salmonella Typhimurium was profiled at 8 hours post inoculation (hpi), 24 hpi, 48 hpi, and 21 days post inoculation (dpi). AUGN was over-expressed at 21 dpi relative to 24 hpi and 48 hpi (P-values < 6.8 × 10 -5 and 2.7 × 10 -6 , respectively). Consistent with the differential expression in relation to immune-response observed in this study, AUGN is a putative tumor suppressor gene and is down-regulated in many cancers [42].
SCG1, SCG2, SCG3, members of the secretogranin family, exhibited differential expression among immunechallenge experiments consistent with the known association of these genes with cell activation, cytotoxicity and microbial defense [45]. Probes on all 3 SCGs exhibited differential expression on two immune-related experiments. SCG1 and SCG2 are differentially expressed in GSE14790 while SCG3 was differentially expressed in GSE11787 [21]. In GSE11787 SCG3 was under-expressed in inoculated pigs relative to controls (P-value < 1.2 × 10 -3 , fold change = 0.33). These results are consistent with the lack of synthesis of endogenous granins in rat PC12 cells infected with recombinant vaccinia viruses [46]. In GSE14790, SCG1 and SCG2 were under-expressed in pigs inoculated with PCV2 relative to un-inoculated control pigs (P-value < 1.8 × 10 -4 , fold change = 0.88 and P-value < 1.5 × 10 -5 , fold change = 0.93, respectively).
Two members of the vascular endothelial growth factor family, VEGFC and VEGFD, were under-expressed in PCV2 inoculated pigs relative to control pigs (P-value < 1.8 × 10 -5 , fold change = 0.62) and also under-expressed at early stages (7 dpi) relative to later stages (19 and 29 dpi) in GSE14790. In agreement with these findings, a loss of endothelial growth factor transcription and increase in pro-inflammatory indicators were reported in the endometrial lymphocytes of pigs at sites of fetal arrest [47].
NPY (probe Ssc.15981.1.A1_at) was under-expressed in PCV2-inoculated pigs relative to control pigs (P-value < 6.6 × 10 -4 ) and, within infection level, NPY was  under-expressed at earlier stages relative to 29 dpi in GSE14790. Consistent with these findings, the levels of NPY mRNA decreased in the blood of rats treated with vinblastine, an anti-cancer drug known to decrease the number of white blood cells of the immune system involved in defense [48]. Similarly, NPY was found to decrease in cattle infected with Bovine Spongiform Encephalopathy [49]. ADML was differentially expressed in GSE14758-D and GSE7314. In GSE14758-D [41], ADML was under-expressed in the mediastinal lymph nodes of PCV2-infected pigs relative to control pigs at 29 dpi (P-value < 1.7 × 10 -3 , fold change = 0.6). Whereas, in GSE7314 ADML was over-expressed at 48 dpi in the blood of pigs inoculated with Salmonella choleraesuis relative to controls (P-value 4.8 × 10 -3 ). The latter result is consistent with the up-regulation of ADML gene expression and increases in systemic circulatory concentrations of ADML in response to the onset and progression of trauma, infection, and sepsis [50]. The apparent Table 3 Differentially expressed prohormone and prohormone convertase genes (P-value < 0.005) across 35 microarray experiments by tissue class (Continued) inconsistency between both experiments may be associated with the differential effects that ADML has on cellular metabolism, immune function, endocrine function, and cardiovascular function. Of the 3 prohormone convertases available in the microarray platform, PCSK1 and PCSK7 were significantly differentially expressed (P-value < 1.3 × 10 -3 ) and PCSK2 was marginally significantly differentially expressed (P-value < 6.5 × 10 -3 ) in GSE14790. PCSK1 was underexpressed in PCV2-inoculated pigs already at 7 dpi relative to 29 dpi, regardless of inoculation at the later stage (P-value < 5.7 × 10 -5 ). Likewise, PCSK7 is under-expressed in PCV2-inoculated pigs relative to controls already at 7 dpi (P-value < 4.2 × 10 -4 ) and, within controls, PCSK7 was under-expressed at early stages (7 dpi, 21 dpi) relative to 29 dpi (P-value < 4.1 × 10 -4 ). These results are in agreement with similar findings that furin, another prohormone convertase, was dysregulated in the immune cells of advanced human atherosclerotic plaques [51] and imply that prohormone convertase, like prohormone genes, are down regulated under immune challenges.
Both PENK probes were over-expressed in tubular and spherical conceptuses relative to filamentous conceptuses (P-value < 2.6 × 10 -6 ) in experiment GSE12705 [20]. This is consistent with results that found PENK mRNA to increase linearly during gestation in the hippocampus of pigs [56]. PTHR was under-expressed in tubular and spherical relative to filamentous conceptuses (P-value < 6.4 × 10 -7 , fold change = 0.02) in experiment GSE12705. This finding is supported by reports that PTHR is present in higher concentrations in fetal pigs than in sows [56]. VEGFC is a representative of the vascular endothelial growth factor family of prohormones that have an important role in the survival and mitogenesis of endothelial cells and lymphaniogenesis and angiogenesis of embryos [57]. VEGFC was overexpressed in pregnant sows relative to non-pregnant sows (P-value < 7.8 × 10 -4 ) in experiment GSE18641 [52]. This finding is supported by a study in the chicken, demonstrating that the chorioallantoic membrane (analogous to the placenta in mammals) contained growth of embryonic microvessels stimulated by VEGFC [58]. The expression profile is also supported by the finding that in mice embryos, VEGFC is required for successful lymphatic vasculature development and lymphatic endothelial cell migration [59].
ADML was over-expressed in the immortalized porcine olfactory bulb neuroblasts relative to the non-neural epithelial cells (P-value < 2.2 × 10 -6 , fold change > 10) in experiment GSE16855 [60]. This result is supported by a previous study that found that ADML is important for regulation of proliferation and differentiation of neural stem/progenitor cells using the mouse olfactory bulb [61].
IGF1 was over-expressed in the neuroblasts relative to non-neural epithelial cells (average P-value < 5 × 10 -7 , fold change > 10) in experiment GSE16855. This result is supported by a study in chickens showing that IGF1 was expressed in the olfactory bulb [62]. Also, IGF2 (probe Ssc.9365.6.S1_x_at) was consistently over-expressed in the hypothalamus of male Iberian pigs relative to all other seven breed-gender combinations (on average, P-value < 2.3 × 10 -4 , fold change = 2.42) in experiment GSE14739-H [63,64]. NPY was over-expressed (Pvalue < 8.1 × 10-4, fold change = 7.94) in neuroblasts relative to non-neuronal cells in GSE16855. This result is consistent with reports that the olfactory bulb exhibit high levels of immunoreactive NPY in the brain of pigs [65] and that NPY may inhibit excitatory neurotransmission in the rat olfactory bulb [66]. VEGFC was over-expressed in neuroblasts relative to non-neuronal cells (P-value < 1.5 × 10 -9 , fold change > 10) in experiment GSE16855. This result agrees with a 30% increase in dividing neuroblasts in olfactory bulb in culture stimulated with VEGFC compared to controls reported by Le Bras et al. [67]. PTHLH was underexpressed in neuroblasts relative to non-neuronal cells (P-value < 2.6 × 10 -4 , fold change = 0.20) in GSE16855. This finding is consistent with reports that PTHLH may be a negative regulator in the differentiation of chondrocytes [68]. PDGFA was over-expressed in neuroblasts relative to non-neuronal cells (P-value < 1.2 × 10 -4 ) in experiment GSE16855. Related to this result, Fressinaud et al. [69] reported that platelet-derived growth factors increase the glutamine synthetase activity in astrocytes in the brain.

Prediction of cleavage sites in pig prohormones
All 97 prohormone sequences were used to predict cleavage and confirm the prediction against known or predicted cleavage sites. These sequences were inferred to have 228 cleavage sites that resulting in a 14.6% prevalence rate (proportion of possible sites that are cleaved). Most sites were cleaved at an arginine (R) such that the most frequently cleaved motifs were xxKR (71%), RxxR (34%) and xxRR (41%), where x denotes any amino acid and K denotes lysine. There were 5% (38) C-terminal single R sites that were cleaved without a basic amino acid in the second and fourth positions preceding the cleavage site (P2 or P4 locations, respectively).
The performance of the cleavage prediction models is presented in Table 4. The correct classification rate ranged from 82% to 92% indicating that a large proportion of the sites were accurately predicted across all models. The human cleavage prediction models had the best performance for most of the statistics followed by the mammalian model. The Known Motif model provided the highest number of true positive predictions but also the highest number of false positive predictions. The Known Motif model provided the highest sensitivity, 77%, indicating more than three quarters of the cleaved sites were correctly predicted as cleaved. However this model also provided the highest number of false positive predictions. Consequently the Known Motif positive predictive power was 35% indicating that, on average, only 35% of sites predicted to be cleaved are expected to be true cleavage sites.
The human models provided the highest number of true negatives resulting in the best model performance compared to the Known Motif and mammalian models. The human artificial neural network models had approximately 60% positive predictive power indicating that most sites predicted as cleaved are expected to be true positives. Although the human logistic models had lower sensitivity than their artificial neural network counterparts, the differences with the human artificial neural network model were only 4 cleaved and 11 noncleaved sites. The high performance of the human models suggests that the cleavage of prohormones that result in potential biologically active neuropeptides in the pig is similar to humans. Noteworthy is that the mammalian model was trained on 51 mammalian prohormones that included 8 pig prohormones. This model provided slightly more true positive predictions and a higher sensitivity than the human logistic model. However, the mammalian model had noticeably more false positive predictions than the human logistic model resulting in lower performance in the other accuracy measures.
The comparison of results across models also provides information on the accuracy of the cleavage assignment, prediction accuracy and potential for a gene to produce bioactive peptides. For 10 prohormones, at least 5 of the models did not predict any cleavage site. However, it must be noted that 4 of the prohormones (ANF, GHRL, IGF1 and PDGFD) are likely to have sites cleaved by proteases other than prohormone convertases. For example, ANF is cleaved by corin, serine peptidase (CORIN) [70].
Genes with no predicted cleavage or assigned cleavage that differ from other species can be used to identify proteins are not cleaved to form smaller peptides. There is no evidence for cleavage of UCN2 and UCN3 to produce mature peptides in mammals [71]. All models failed to predict two cleavage sites in TOR2X. The first site, an N-terminal dibasic 'RK' , is known to be rarely cleaved across species [72]. The second site is a cleavage found in humans that forms alpha-and beta-salusin but this site may not be cleaved in the pig since the pig sequence, like the bovine sequence, only has a single basic site instead of the human dibasic 'RR' site. Similarly for TKN4, the genomic prediction and supporting EST data indication a change from an R in other species to a glycine amino acid in the pig sequence that may prevent the formation of the 'Hemokinin' peptide.
The pig CRSP2 protein sequence lacks the 'KR' and a C-terminal cleavage site that are cleaved in human CALC and CALCB genes to produce Calcitonin generelated peptide 1 and Calcitonin gene-related peptide 2. Therefore it is unlikely that pig CRSP2 would provide these calcitonin peptides. The assigned cleavages in the RES18 protein are necessary to provide a potential triskadecapeptide peptide reported by Bloomquist et al. [73]. This potential peptide has flanking dibasic cleavage sites in the mouse and rat but this peptide has not been experimentally confirmed. The corresponding region in human, bovine and pig sequences are monobasic and lack common PC cleavage motifs suggesting that these species probably cannot form this peptide.
Examination of the potential cleavage sites in IAPP indicated that a mutation from R to Q in the N-terminal cleavage site is necessary to produce the Islet amyloid polypeptide. Examination of the corresponding ESTs indicated that 2 swine ESTs ([GenBank:AJ649149] and [GenBank:AJ649469]) were 100% identical to the rat genome and consequently invalid sequences. Two other ESTs ([GenBank:EW569366], [GenBank:BF712755]) matched the region that supported the genomic prediction. The predicted protein sequence including the potential cleavages sites of the expected IAPP was less than 80% identical to other mammals sequences compared to typically over 85% identity between the human and most other mammalian sequences. Potter et al. [74] questioned the capability of IAPP to form amyloids after examining the functionality of a synthesized pig sequence based on the [GenBank:BF712755] EST sequence. The predicted prohormone sequence and cleavage prediction results also strongly suggest that the pig is unlikely to be able to form IAPP. This reflects the importance of proteomic studies involving cleavage to first determine that a species can produce a peptide.

Conclusions
The pig is an important biomedical and agricultural research species. Results from the first genome-wide study of pig prohormone and prohormone convertase genes, functional annotation and prediction of prohormone cleavage have been presented. This study was enabled by the availability of the pig genome sequence and of 35 gene expression experiments that evaluated a wide range of conditions in pigs. These results offer more insights into the role of neuropeptides on biological processes such as reproduction, development, growth, and health and support targeted empirical confirmation. The bioinformatics pipeline used in this study can be used to identify prohormones or other sets of genes in species with similar sequence resources. Confirmatory insight into the pig prohormones can be expected from proteomic mass spectrometry studies.
Combining complementary bioinformatic resources, 95 prohormone genes, 8 prohormone convertases and one cleavage facilitator were discovered in the pig genome and raw sequence repositories. We uncovered 11 prohormone genes that have not been previously reported and one potentially incorrectly reported. The high performance of the models used to predict cleavage in the pig prohormones suggests that the prohormone cleavage in pigs is similar to humans. The analysis of 35 gene expression experiments identified various neuropeptide genes differentially expressed in immune-related tissues, embryo and placenta and the central nervous system including AUGN, IGF2, the family of SCGs, NPY, ADM and ADML, PENK, PTHR, and VEGFC. Experiments are required to confirm that the pig does not produce the bioactive neuropeptides UCN2, UCN3, TOR2X, TKN4, IAPP, and CRSP2 as suggested by the cleavage prediction models.
Candidate genes were searched for in the pig genome 10.2 assembly using the bioinformatics pipeline developed by Southey et al. [11,13]. The protein sequence of each candidate gene in the registry was searched on the pig genome assembly using the TBLASTN, BLASTP and BLASTN programs from NCBI BLAST (version 2.18) [81] with default parameters settings (E-value < 10 and BLOSUM62 scoring matrix) and filtering disabled. In addition, sequences not used in the pig genome assembly (including unassigned genomic regions, whole genome shotgun sequencing and trace archives) were searched when there was no suitable BLAST match to a candidate gene or when the alignment to the genome assembly suggested a missing genomic region. This strategy allowed the annotation of genomic regions that were partly or not included in the assembly. The BLAST matches were examined based on the alignment score and E-value to identify the most likely matches and genomic location of the corresponding prohormone. The identified pig genomic region that encompassed the BLAST match was further extended approximately 500 base pairs to the 5′ and 3′ ends of the match. Matches were also screened for alignments to multiple homologous prohormone genes that could indicate gene duplication events in the pig genome. The gene parsing tool Wise2 [82] was used to predict the protein sequence within the genome regions detected with BLAST. The genomic region was further extended when only a partial protein sequence was predicted. In this study, Wise2 compared the target protein (preference was given to pig protein sequences, followed by human, cattle and other mammals) to the pig genomic DNA sequence identified by BLAST to infer the gene structure based on a model that includes introns and frameshift errors. Each predicted gene was compared to the UniProt and NCBI Gene databases to assess the accuracy of the prediction based on previously reported pig genes. To further confirm the Wise2 predictions, the protein sequence predicted from the gene model was also compared to the corresponding published mammalian sequences using the multiple sequence alignment tool Clustalw [83]. The multiple sequence alignment maximized the likelihood of identifying homologous genes. The predicted sequences were also searched against the pig entries in the NBCI EST database to confirm the presence of the predicted protein sequence. The pig entries in the NCBI EST database was also used to complete the protein sequence when the genome coverage was incomplete.
Functional annotation of the pig prohormone and convertase genes A review of the pig microarray gene expression experiments available in the NCBI GEO database [84] indicated that the Affymetrix Porcine Genome Array GPL3533 [85] was the most commonly used platform. The UniGene database was searched for sequences that represent prohormone and prohormone convertase genes. This information was used to identify the probes representing prohormone and prohormone convertase genes in the Affymetrix Porcine Genome platform.
Thirty-five experiments that used the Affymetrix Porcine Genome platform were identified in GEO. Selected experiments had a minimum of 6 microarrays and a maximum of 80 microarrays. The sources and main features of these experiments are provided [see Additional file 4]. The wide range of selected microarray experiments available supported a comprehensive characterization of the association of prohormone and associated neuropeptide and convertase genes with various biological processes.
The gene expression data were pre-processed and normalized using the Affy R package [86]. Steps included the log-2 transformation and GC-robust multichip average normalization of the gene expression measurements. All probes in the platform were analyzed using ANOVA to identify those that exhibited differential expression across the conditions studied. The false discovery rate [87] approach was used to adjust the statistical significance of the differential expression and account for multiple testing across all probes. The normalization, one or two-way ANOVA and multiple test adjustment of the results were done using Beehive [88].

Prediction of cleavage sites
The location of the cleavage in pig prohormone proteins that would result in potentially active neuropeptides was predicted using NeuroPred [72]. Complete prohormone sequences from UniProt were used to predict cleavage in preference to the predicted sequences. In limited cases, EST sequences were combined with the genomic data and published partial sequences to predict the complete prohormone sequence. For example, for Chromogranin-A (CMGA), three glutamic acids were missing in the genome-based predictions that were present in the corresponding UniProt fragment sequence ([Swiss-Prot: P04404]) and EST sequence [GenBank:EW261315] permitted the prediction of the complete pig CMGA protein sequence. The location of the potential cleavage sites in the pig prohormones were inferred by homology to human data.
Complementary cleavage prediction models trained on confirmed cleavages from mammalian sequences [12,78,79] were used to predict cleavages in the pig prohormone sequences. These models included the known motif model that searches for sites with specific combinations of basic amino acid associated to cleavages reported in other species [78], mammalian logistic regression [79], and human logistic regression and artificial neural network models based on amino acids only or amino acids combined with the physicochemical properties of amino acids [12].
Known or predicted cleavage sites on all 97 prohormone sequences were used to assess the performance of the models to predict cleavage. The "observed" cleavage sites known or inferred from homology to other species based on a literature search [11][12][13]19,[78][79][80] were compared to the cleavage sites predicted by the models. The counts of the true positives (number of correctly predicted cleaved sites), true negatives (number of correctly predicted non-cleaved sites), false positives (the number of incorrectly predicted cleaved sites) and false negatives (number of incorrectly predicted non-cleaved sites) or functions of the counts were used to assess the model performance. These measurements were used to compute the correct classification rate (number of correctly predicted sites divided by the total number of all sites), sensitivity (number of true positives divided by the total number of cleaved sites), specificity (number of true negatives divided by the total number of non-cleaved sites), positive predictive power (number of true positives divided by the total number of sites predicted to be cleaved), negative predictive power (number of true negatives divided by the total number of sites predicted to not be cleaved), Mathew's correlation coefficient between observed and predicted cleavage. The area under the receiver operator characteristic or ROC curve relating sensitivity and 1 -specificity [78] was also calculated where area values lower than 0.7 indicate poor model performance.