Research article | Open | Published:
Discovery and characterization of 91 novel transcripts expressed in cattle placenta
BMC Genomicsvolume 8, Article number: 113 (2007)
Among the eutherian mammals, placental architecture varies to a greater extent than any other tissue. The diversity of placental types, even within a single mammalian order suggests that genes expressed in placenta are under strong Darwinian selection. Thus, the ruminant placenta may be a rich source of genes to explore adaptive evolutionary responses in mammals. The aim of our study was to identify novel transcripts expressed in ruminant placenta, and to characterize them with respect to their expression patterns, organization of coding sequences in the genome, and potential functions.
A combination of bioinformatics, comparative genomics and transcript profiling was used to identify and characterize 91 novel transcripts (NTs) represented in a cattle placenta cDNA library. These NTs have no significant similarity to any non-ferungulate DNA or RNA sequence. Proteins longer than 100 aa were predicted for 29 NTs, and 21 are candidate non-coding RNAs. Eighty-six NTs were found to be expressed in one or more of 18 different tissues, with 39 (42%) showing tissue-preference, including six that were expressed exclusively in placentome. The authenticity of the NTs was confirmed by their alignment to cattle genome sequence, 42 of which showed evidence of mRNA splicing. Analysis of the genomic context where NT genes reside revealed 61 to be in intergenic regions, whereas 30 are within introns of known genes. The genes encoding the NTs were found to be significantly associated with subtelomeric regions.
The 91 lineage-specific transcripts are a useful resource for studying adaptive evolutionary responses of the ruminant placenta. The presence of so many genes encoding NTs in cattle but not primates or rodents suggests that gene loss and gain are important mechanisms of genome evolution in mammals. Furthermore, the clustering of NT genes within subtelomeric regions suggests that such regions are highly dynamic and may foster the birth of novel genes. The sequencing of additional vertebrate genomes with defined phylogenetic relationships will permit the search for lineage-specific genes to take on a more evolutionary context that is required to understand their origins and functions.
The primary function of the placenta is to regulate the transport of gases, nutrients and waste products between mother and fetus . The placenta also serves as an endocrine organ, producing estrogens, progesterone and placental lactogens that are important for the maintenance of pregnancy . While these functions have been conserved in all eutherian mammals, the relatively large variation in placental architecture  makes the placenta an attractive model system for studying adaptive evolutionary changes . Placentae are classified on the basis of their gross shape and the distribution of contact points between the fetal tissues and the maternal endometrium . The synepitheliochorial cotyledonary placenta of ruminants has three distinct tissue layers and is regarded as the most complex as compared to other placental types . By contrast, in the discoid hemochorial placenta of primates and rodents, the fetal chorionic epithelium is directly bathed in maternal blood because the three maternal tissues layers are degraded. The molecular basis for these anatomical changes and their adaptive significance remain largely unknown.
Recent studies demonstrating that phenotypic changes with adaptive significance can be caused by the action of individual genes provide an important rationale for the identification of genes that may be highly divergent or unique to a specific lineage or clade [eg, ]. Furthermore, there is a relative abundance of novel or lineage-specific transcripts (NTs)  and lineage-specific regulatory non-coding RNAs (ncRNAs) [6–8] in the transcriptomes of different eukaryotes. However, little is known about the role of divergent genes and lineage-specific transcripts in adaptive evolution. In ruminants, there is unequivocal evidence for lineage-specific and highly divergent genes expressed in the placenta and/or trophoblast e.g., genes encoding interferon-tau , the placental lactogens , the pregnancy associated glycoproteins  and the prolactin related proteins . All of these highly divergent proteins appear to play adaptive roles in the reproductive biology of ruminants.
Given the unique adaptations of the reproductive system of ruminants, our goal is to use comparative genomics to identify genes and ncRNAs that are responsible for these evolutionary changes. Toward that end, we recently developed a bioinformatics strategy to mine collections of expressed sequence tags (ESTs) for divergent homologs and novel transcripts . This strategy led to the discovery of the ULBP gene cluster and eight divergent homologs in cattle [12, 14]. Herein, we describe the application and extension of our approach for identifying and characterizing NTs expressed in mammalian tissues. Using this approach, 91 NTs were identified in a collection of cattle placenta ESTs and then verified by in silico extension with DNA sequences in the public domain databases, gene expression profiling, and alignment to whole genome sequence.
Identification and characterization of novel transcripts in cattle placenta
A collection of 12,614 5' ESTs from a cattle term placenta cDNA library was reduced to a working set of 373 putative NTs and divergent homologs using pairwise BLASTN  searches against non-cetartiodactyl EST and genome databases (October 2005 freezes), followed by in silico extension and full-clone sequencing of cDNA inserts (Table 1). Analysis of the sequence-extended EST-containing clones using TBLASTX searches against human and mouse UniGene , and against ESTs from non-cetartiodactyl species, permitted the distinction of divergent (N = 75) from unknown (N = 298) transcripts (Table 1). Repetition of in silico extension of EST sequences followed by removal of homologs using subsequently updated databases (April 2006 freezes) removed an additional 134 ESTs, thus leaving 164 putative NTs. Using the 6.2× Btau_2.0 cattle genome assembly  as a reference, 73 transcripts were found likely to represent priming from poly A tracts of genomic DNA. These artifacts were subsequently removed, bringing the final working set to 91 NTs that have an average length of 993 bp (Table 1; Additional file 1). The definition of a transcript as novel thus indicates that the nucleotide sequence or hypothetical proteins encoded by it does not have similarity to any non-cetartiodactyl DNA or protein sequence at the time the databases were searched. The operational term is not meant to imply an evolutionary mechanism, such as gene loss or rapid divergence.
Alignment to the cattle genome sequence assembly allowed polyA signals to be identified within 50 bp downstream from the end of the aligned NT. Polyadenylation signals were found in 86/91 (95%) of the NTs (Additional file 2). The consensus signal, AATAAA, was present in 54 (59%) of the NTs; 15 (16%) had the less conserved signal ATTAAA, and 17 (19%) had rare polyadenylation signals experimentally identified in human mRNA . A polyadenylation signal was not detected in 5 (5%) of the NTs, likely representing incomplete transcripts.
Analysis of the 91 NTs revealed 64 NTs with one or more open reading frames (ORFs) >33 codons (Figure 1). Among these 64 NTs there are 78 predicted ORFs >33 codons, all of which were considered as candidates for encoding novel proteins (Figure 1; Additional file 3). The cDNA clone with the longest ORF, BTC1_14RD, contains 235 codons. TBLASTN of the translated ORFs against all predicted cattle transcripts resulted in 24 unique hits, of which seven are to hypothetical proteins, 15 are to predicted proteins (having some supporting molecular evidence), and two have similarity to known cattle proteins (Additional file 2). While all ORFs >33 codons terminate in a stop codon, 19 (21%) do not possess an ATG start codon, suggesting that at least some of the NTs represent 3' regions of genes or have ORFs with alternate start codons. The ORFs possessing an ATG start codon were analyzed for the presence of the Kozak consensus sequence RMC-ATG-G, a signal for eukaryotic translation initiation , where R is a purine and M is [AC]. One ORF matched this consensus pattern and 24 ORFs matched the less restricted Kozak pattern R-N-N-ATG-R, where N is any nucleotide, and R is a purine.
Protein motifs are predicted in the translated ORFs of four NTs (Additional file 2, Table 2). In addition, a transmembrane helix (TMH) was identified in one ORF (BTC1_403NG) predicted to contain a single-span TMH. Signal peptides are predicted in four ORFs with no concomitant TMH prediction (Additional file 2) thus indicating that these transcripts encode soluble/secreted proteins. Functional elements located in untranslated regions (UTRs) are predicted in 25 NTs (Additional file 2, Table 2). The NT BTC1_43PW contains a predicted selenocysteine insertion sequence (SECIS) element. These elements are required for translation of the UGA codon as selenocysteine in mRNAs of selenoproteins, several of which are species-specific and participate in peroxide degradation and antioxidant reactions . The UTR sequence region corresponding to the SECIS prediction was manually verified for the appropriate secondary structure (Additional file 4). Two sequences, BTC1_40PW and BTC1_14RD, are predicted to contain conserved ncRNA secondary structural elements within their 3' UTRs thus implicating them in regulatory functions.
The GC-content average of the NTs, mRNAs and genomic DNA sequence is 0.50 (range 0.33 to 0.7), 0.51, and 0.45, respectively (Additional file 2). Ten percent of NTs have low GC content average (<0.4) as compared to 12% of full length mRNAs and 31% of genomic scaffold sequence that contains the NTs. Candidate exonic CpG islands were identified in 5' ends of four NTs (Table 3, Additional file 2). Criteria for CpG islands were length >200 bp, GC content ≥ 53% and observed/expected CpG ratio ≥ 0.63 . Anchoring the NTs to the 6.2× draft cattle genome sequence allowed the identification of 26 additional NTs with CpG islands upstream of their start sites. Other NTs with regulatory sequence features included 12 with inverted repeats of at least 11 bp. The longest inverted repeats are 18 bp and separated by approximately 300 bp. In addition, 7 NTs contain G-quartets, which are tetrads of guanine/purine tetramers that are implicated in transcription pausing, mRNA stability, recombination hotspots, stability of chromosomes and interactions of telomeres [reviewed in ].
Twenty-one NTs qualified as candidate ncRNAs (Table 3, Additional file 2). These NTs have no ORF >33 codons, and no predicted exon within 5 kb of flanking cattle genome sequence. Among these, 20 contain a known polyadenylation signal, of which seven are spliced, and 14 align to the genome sequence along the transcript's entire length with >95% identity (Additional file 2).
Genomic context of the NTs
The genomic organization of the NT genes was determined by BLASTN against the 6.2× cattle genome draft sequence. All 91 NTs have matches in the cattle genome (Additional file 2), 49 of which have ≥98% identity over >95% of their length and 48 have polyadenylation signals (intronless transcripts). The remaining 42 NTs have well-defined intron-exon boundaries indicative of mRNA splicing (an average of 3 exons), of which 40 possess a polyadenylation signal (Table 4; Additional file 2). CpG islands were located upstream of 16 of these 40 transcripts. Of the 91 NTs, 68 were mapped in silico to chromosome locations in the cattle genome on the basis of existing radiation hybrid (RH) and comparative mapping information  (Additional file 2). The remaining 23 were located on unmapped scaffolds. Among the 68 mapped NTs, 10 were located on BTA19 (P > 0.05). This analysis also revealed that 61 (67%) of the NTs are located within "intergenic" regions and 30 (33%) of the NTs are within introns of known genes (Table 3, Additional file 2). Three of the NTs on BTA19 (BTC1_390NG, BTC1_28PW, BTC1_8NG) were located within introns of the developmentally regulated genes HOXB3, JAF1 and ATP1a2a. In addition, 24 NTs (26%) are located either in subtelomeric regions or at the boundaries of homologous synteny blocks as defined by Everts van der Wind et al.  (Table 3; Additional file 2). The distribution of NTs was positively associated (P < 0.05) with the subtelomeric regions (< 2 Mbp from telomere) as compared to the distribution of randomly chosen RefSeq genes.
Among the 91 NTs, 71 have no NCBI  annotation. The 20 annotated NTs that aligned to the cattle genome were analyzed in greater detail. Each of them was found to have a gene model and a cattle RefSeq identification number (Additional file 2). Most appear to be either alternatively spliced variants, antisense to the predicted genes, or long and divergent 3' ends of predicted genes. The 20 annotated NTs were examined further by comparing full-length mRNA sequence from the cattle RefSeq prediction to the human genome. Similarity between the full-length cattle RefSeq predictions and human genes was found only for 12 of the NT-containing RefSeq genes (BLASTN E value < 10-10). Among these 12 cattle RefSeq genes with putative human orthologs, the NTs contained within them represent novel splice products, whereas three are novel antisense products. In silico comparative mapping confirmed homologous positions within the cattle and human genomes, thus providing additional supporting evidence that these 12 NTs are part of highly divergent genes and/or genes created de novo (Additional file 2).
It was also possible to predict locations where genes encoding the NTs should be in the human genome (Additional file 2; Figures 2, 3, 4). This was accomplished by identifying the cattle genome sequence flanking the NTs with significant nucleotide similarity in the human genome (although the NTs themselves did not match the human genome using a BLASTN and TBLASTX E-value threshold of 10-10). Anchoring the NTs to the human genome using conserved flanking sequences revealed that 69% have an assumptive location in intergenic regions or within an intron of a known human gene. The genomic context for all 91 NTs is given in Additional file 2. A detailed description of the genomic context of three NTs is presented below. The number that can be presented is limited by available space (all alignments can be found in Additional file 5).
BTC1_14RD and BTC1_130FL: alternatively spliced mRNAs of a novel, paralogous gene residing amongst known transcription factors
BTC1_14RD (GenBank:XM_611254) and BTC1_130FL have 93% and 92% nucleotide identity, respectively, to artiodactyla-specific transcript 1 (Ast1) identified by Kim and coworkers . Neither gene has an ortholog in the human or mouse genome. The three transcripts, Ast1, BTC1_14RD and BTC1_130FL, were aligned to BAC AC146804 (Figure 2; Additional file 6). Both BTC1_14RD and BTC1_130FL mapped at position BTA18:55591560–55600453, 5.4 kb from Ast1 (Figure 2). Genscan  also predicts a cattle gene in this region, which is supported by alignment of numerous cattle ESTs and a CpG island flanking the transcription start site (Figure 2). From the alignments it is apparent that BTC1_130FL is an alternatively spliced form of BTC1_14RD and that both transcripts represent a gene that is paralogous (>90% similar) to Ast1 (Additional file 6). The assumptive human genome context of BTC_14RD and BTC1_130FL was then investigated by anchoring to the human genome, conserved sequences flanking the NTs in the cattle BAC AC146804 (on BTA18). Nucleotide similarity was used to anchor BTC1_14RD, BTC1_130FL and Ast1 mRNAs to HSA19q, between the human genes ZNF71 and ZIM2 (Figure 2). This region of the human genome is rich in retrotransposed sequences, imprinted genes (PEG3 and Zim2) and genes encoding zinc-finger proteins thus indicating a high level of evolutionary and biological activity.
BTC1_146JE: a putative non coding RNA
BTC1_146JE is a 2206 bp transcript that contains an inverse repeat, and is found only in cattle and bottleneck dolphin (cetartiodactyl-specific). There is no ORF and no BLAST hit to any non-cetartiodactyl exon within 5 kb of flanking genomic sequence 5' and 3' of the NT gene on BTA21 (Figure 3; Table 3). A large number of unannotated, spliced cattle ESTs align to contig54150 in the same position further supporting its characterization as a novel spliced transcript.
The genomic DNA sequence flanking BTC1_146JE in contig54150 permitted comparative anchoring to HSA14q32.31 in an intergenic, non-conserved region containing a cluster of small nucleolar RNAs (snoRNAs; Figure 3). These snoRNAs are encoded in the introns of the non-coding maternally expressed gene MEG8 . MEG8 is currently not included in the UCSC  database of known genes and thus not shown in Figure 3. To test the possibility that BTC1_146JE represents the pre-processed RNA for a snoRNA, a BLASTN search was carried out using the cluster of snoRNAs as query sequences (word size 7 and E-value threshold of 0.01). The 71 bp snoRNAs 14q(II-8) and 14q(11-9)  aligned with BTC1_146JE with short matches (23/26 and 27/30 identities, respectively). This suggests that BTC1_146JE represents a precursor RNA from which a cetartiodactyl-specific snoRNA (or miRNA) is processed.
BTC1_113FL: a novel transcript expressed preferentially in the thalamus
BTC1_113FL (GenBank:XM_611248) is a 659 bp transcript found only in cattle. It encodes a hypothetical protein 69 aa in length and whose gene is located on BTA2 (Table 3). Alignment of BTC1_113FL to cattle contig74653 reveals a gene with five exons (Figure 4). The gene is located within 1.5 Kbp of the 3' end of and in opposite orientation to SECP43. The presence of a gene in this location is strongly supported by a large number of spliced ESTs containing at least one GT/AG splice site, a Genscan prediction of 2 exons, a CpG island that spans the first exon and the probable transcription start site (Figure 4), and detection of expression in multiple tissues. DNA sequence flanking the gene corresponding to BTC1_113FL on contig74653 anchors it to HSA1p35.3, consistent with the available comparative mapping data . Visual inspection indicates that BTC1_113FL should fall in a region of the human genome occupied by MGC45806 (GenBank:NM_152304), a RAB GTPase oncogene involved in vesicle-trafficking; however, BLASTN, TBLASTX and TBLASTN searches of BTC1_113FL against MGC45806 sequence resulted in no significant matches, demonstrating a lack of detectable homology between the exons of the two transcripts. BLAT  alignment of this NT against the human genome also reveals no alignment anywhere within the chromosome cytogenetic band where the NT is anchored by its flanking sequence. These data provide evidence that BTC1_113FL is a lineage-specific novel transcript.
Expression and tissue distribution of NTs
Expression levels of the NTs were analyzed in 17 tissues from a one week-old Jersey calf and a term placentome. Expression of 86/91 NTs could be analyzed (5 had no representative cDNA element on the microarray). The expressed NTs were categorized with respect to the presence or absence of ORF(s) as well as with expression levels classified as high, moderate or low on the basis of an arbitrary scale (Table 5). Among the 60 expressed NTs with ORFs, 55% were expressed at a low level in all 18 tissues, and 45% were expressed at moderate or high levels in one or more tissues (Table 5). A similar distribution of expression levels in tissues was found among the NTs without ORFs. Tissue-preference in expression patterns of NTs was analyzed further by determining those NTs that were expressed greater than two-fold in any one tissue compared to at least 13 out of 17 other tissues (Figure 5; Additional file 2). A total of 39 NTs show tissue preference in their expression pattern. Of these, 28 were preferentially expressed in a single tissue. Six NTs were preferentially expressed in placentome, of which two were predicted to be ncRNAs. Ten different tissues showed exclusive expression of one or more of the NTs, with placentome and thymus having the largest number.
Comparative genomics, bioinformatics and microarray analysis were used to identify 91 transcripts encoded in the cattle genome, but not encoded in the genomes of non-ferungulate mammals (see further discussion below concerning the dog genome). The yield of NTs from the original EST collection is 91/12,614 = 0.8%, suggesting that NTs are relatively rare. Among the 91 NTs, 78 ORFs were identified, of which 48 are <100 codons and 30 are >100 codons (Figure 1; Additional file 3). The latter have a high probability of coding for a protein . Using InterProScan , the lack of Pfam HMM matches for 99% of NT ORFs is strong evidence for absence of homology to known proteins. Anchoring the NTs to the cattle genome allowed identification of 30 transcripts having CpG islands upstream of their start sites (Additional file 2), providing additional support for their classification as protein-coding genes or ncRNAs. Specific protein functional motifs were identified in 10 predicted proteins encoded by the NTs, and 29 have functional non-coding motifs (Table 2; Additional file 2). All but five of the NTs showed evidence of active transcription in one or more tissues, and six were found to be preferentially expressed in cattle placentome, which is the source tissue of the EST collection used to mine for NTs. These results collectively provide the first conclusive evidence for an abundance of lineage- and tissue-specific transcripts encoded in the cattle genome.
The 21 NTs identified as high-probability ncRNAs provide a useful set of probes for exploring gene regulation in placental development and function [5, 6]. Five of the putative ncRNAs are preferentially expressed in tissues that comprise the brain-immune-endocrine axis; placentome, thymus, thalamus, cerebrum or cerebellum. Seven contain inverted repeats that may be involved in internal base-pairing and gene regulation . These putative ncRNAs may thus represent spliced, single-exon, primary snoRNA or microRNA transcripts . Although microRNAs (miRNAs), snoRNAs and other ncRNAs that are not polyadenylated, may have been filtered out using our methods, on the basis of our findings, it is clear that placenta is a rich source of ncRNAs. Further studies are needed to clarify their functions in placental physiology.
The availability of a draft of the cattle genome sequence provided an opportunity to study the comparative genomic organization of the NTs, to confirm their authenticity, and to distill evidence for their origin, evolution and function. For the examples presented, as well as others in the dataset, the genes encoding the NTs are flanked by genes that are conserved in the human genome. This allowed us to identify a presumptive human genome context for the NTs. The recent availability of a 7.6× draft of the dog genome sequence allowed us to ask ex post facto whether the NTs are present in the dog genome. It was interesting to find that only five of the NTs (BTC_55FL, BTC1_102FL, BTC1_390NG, BTC1_21PW, and BTC1_40RD) matched sequences in the dog genome, suggesting that the sequences encoding these NTs were present in a ferungulate ancestor. The most parsimonious explanation for the presence of NTs in the cattle genome but not in other non-ferungulate mammalian genomes is that the NT genes were deleted from a common ancestor of primates and rodents after its divergence from the ferungulates (assuming that these five NTs will be found in other ferungulate genomes as well). Thus, in total, 65 NTs were identified that are (to date) only found in cattle, and 21 were identified in cattle and other cetartiodactyla (Table 4). Among the ruminants, this could represent the de novo formation of genes by overprinting , more recent gene deletion, extreme divergence, or the lack of complete genome sequence information for the other species. Other proposed mechanisms for the appearance of lineage-specific genes, such as retrotransposition [reviewed in ], are not excluded by our analysis. Further study of the phylogenetic distribution of the NT genes will provide a better understanding of their origin, and the timing of gene loss/gain in ancestral species.
The cetartiodactyl NTs BTC1_14RD and BTC1_130FL represent interesting examples of transcripts that are encoded within a highly dynamic genomic context (Figure 2). BTC1_14RD/BTC1_130FL and Ast1 (artiodactyl-specific transcript1) are located in a region of the cattle genome that is surrounded by genes encoding zinc-finger proteins. Ast1 and its neighboring genes, PEG3 and ZIM2, were previously shown to undergo lineage-specific imprinting, and PEG3 and ZIM2 are thought to have undergone rearrangements independently in different lineages . Roughly 80 Kbp of the cattle genome between PEG3 and ZIM2 (BTA18:55,580 Kbp – 55,655 Kbp) spanning BTC1_14RD and Ast1 is absent in the human genome, and there are a large number of segmental duplications in this region. The entire locus maps ~1 Mbp from the telomere of BTA18. It is known that telomeric and subtelomeric regions are highly active in segmental duplications and the formation of novel genes . Thus, the artiodactyl-specific genes for Ast1 and BTC1_14RD appear to have been created by a segmental duplication. Analysis of other ferungulate genomes will shed greater light on the origin and evolution of these interesting genes. Given their genomic context and imprinting status these genes may play an adaptive role in placental function.
The significant number of novel transcripts expressed preferentially in cattle placenta and in other tissues raises the question of their role in adaptive evolution. Are the genes encoding these transcripts lost in other genomes because they are dispensable or do they have adaptive evolutionary significance? Is gene gain by insertion, segmental duplication and/or chromosome duplication a driving mechanism of adaptive evolution? Evolutionary theory has long held that adaptive phenotypic change is realized through changes in developmental processes . The genes that control these developmental processes are known to be highly conserved  and it is believed that adaptive evolutionary change is fueled by mutations that modify the expression of these conserved regulatory loci [36, 37]. We and others have proposed that rapidly evolving genes, "novel" or lineage-specific genes, and ncRNAs play a role in mediating changes in gene expression that affect regulatory genes controlling fundamental developmental processes [13, 38–40]. Lineage-specific genes can appear as a result of non-homologous recombination of exons between different genes [32, 41–43], or from fast-evolving duplicated genes that have lost significant sequence similarity even within relatively short evolutionary time-spans [44, 45]. Keese and Gibbs  provided numerous examples of genes that are created de novo by translation of previously unused reading frames of existing coding and non-coding genomic DNA, a mechanism that has been called "overprinting" [46, 47]. Each of the NTs has its own evolutionary history, and understanding their origins will require sequence information from additional mammalian genomes. Whether they are rapidly evolving and dispensable  and/or have adaptive functions can only be determined by experimentation .
The 91 lineage-specific transcripts discovered in the present study are a new resource for studies of adaptive changes in placental architecture and function. The tissue distribution of the NTs suggests that many of them also have adaptive roles in other tissues. The presence of so many lineage-specific genes in cattle and their association with subtelomeric regions, which are hotspots for chromosome rearrangements and recombination, suggests that gene loss and gain are important mechanisms of genome evolution in mammals. The sequencing of additional vertebrate genomes with defined phylogenetic relationships will enable the search for lineage-specific genes to take on a more evolutionary perspective that is required to understand their origins and functions.
Identification and characterization of novel transcripts expressed in cattle placenta
A collection of 12,614 5' ESTs from a normalized and subtracted cattle placenta cDNA library was selected as an EST resource for novel transcript discovery . The repeat-masked ESTs were analyzed for similarity to non-cetartiodactyl ESTs and genomic sequences using BLASTN  at a threshold E (expectation value) <10-05, and NCBI dbEST and non-redundant DNA sequence databases from which cetartiodactyl sequences were removed. All ESTs that matched non-cetartiodactyl sequences at E < 10-05 were removed from the starting set (Table 1). Another BLASTN search against human draft sequences at E < 10-10 removed additional ESTs from the set. The entire process of searching, parsing, and subtraction was carried out iteratively with a set of pipelined Perl scripts (PipeBLASTN).
PipeBLASTN was followed by in silico extension of the remaining placenta ESTs using public domain cattle ESTs. The ESTs were extended using a custom clustering algorithm and the CAP3  assembly program. The clusters were created using stringent parameters (minimum overlap length of 40 bp and 95% minimum sequence identity). CAP3 was run with default parameters, on each EST cluster. The in silico extended ESTs were run through PipeBLASTN again to remove any non-cetartiodactyl homologs. The in-house source clones of the remaining 5' extended ESTs were sequenced in the 3' direction using an anchored oligodT primer. The 3' ESTs were vector- and quality-trimmed, repeat-masked, and assembled to their 5' extended counterparts to obtain full-length clone sequences. Those 5'-3' pairs that did not overlap were subjected to additional rounds of primer-walking until complete sequences were obtained.
To extend the newly assembled sequences further the process of clustering and assembly with public domain cattle ESTs followed by PipeBLASTN was carried out again to yield a set of putative NTs and divergent orthologs. The assembled sequences were manually edited and proofed for spurious assemblies, un-called bases, and reverse-complementarity. The divergent homologs were separated from the putative NTs by interrogating the sequences for distant homology in other species using BLASTX against non-redundant proteins (NCBI, April 2006), and TBLASTX against human and mouse UniGene (Build 190 and 152, respectively) and ESTs from other species (NCBI, April 2006). Sequences that aligned to database sequences below an empirically chosen E-value threshold of 10-10 were designated as putative divergent homologs , and those that scored above this threshold were designated NTs. The NTs with similarity to cetartiodactyl-specific genes were identified by BLAST against a database of cetartiodactyl-specific non-redundant sequences and ESTs. Sequence data from this article have been deposited with the DDBJ/EMBL/GenBank data libraries and the accession numbers are listed in Additional file 2.
The second assembly (March 2005, Btau_2.0; BCM-HGSC) of the 6.2× cow genome draft sequence  was used to determine how many NT source clones were primed from an internal polyA tract during their sequencing from the 3' direction with an anchored oligodT primer. To do this, each NT contig was manually checked to ensure the terminal positioning of the 3' EST within the contig. The NTs were then aligned to the scaffold sequence. The aligned region of the scaffold and flanking 100 bp was extracted for each of the aligned NTs. If the 3' EST of a NT was primed by an internal polyA tract, it would be visible in the scaffold downstream from the 3' end of the aligned NT. These NTs were removed from the set. In addition, trimming of low-quality sequences from 3' ESTs may have resulted in complete or partial removal of the polyadenylation signal. Each of the corresponding scaffold sequences was scanned for the presence of a polyadenylation signal within 50 bp downstream from the end of an aligned NT.
Identification and characterization of ORFs
High-probability ORFs were identified in the NTs using NCBI's ORFfinder . The locations of start and stop signals were determined in all the three reading frames for NTs with a known polyadenylation signal, and in all six reading frames for NTs with no identifiable polyadenylation signal. The most probable complete or longest partial ORFs were selected using a threshold length of 100 bp. The manual analysis of all the six reading frames involved choosing the longest ORF that either contained more than one downstream stop codon, or one or more juxtaposed downstream start and stop signals within 20 codons of each other. The selected ORF contained at most two upstream stop signals, and one upstream ORF < 30 codons in length. The use of a six-frame analysis to select ORFs that were clearly demarcated by strong start and stop signals helped to reduce the probability of picking false ORFs even though we used 33 codons as a threshold. The complete ORFs that had a start and a stop codon were analyzed for the presence of the Kozak consensus, the eukaryotic translation start site context . False positives in ORFs were predicted by randomizing the NT and then predicting ORFs in these sequences using the same rules for ORF prediction . Because ORFS between 33 and 100 codons are predicted to have ~50% false positive rate , all ORFs >33 codons were considered as candidates for encoding small novel proteins. The sequences of the NTs and their predicted proteins can be found in Additional files 1 and 3.
The ORFs were analyzed for protein motifs and domains using InterProScan . Only those motifs with a precision rate >93% were reported . Additionally, ORFs were searched for transmembrane helices using TMHMMv2.0  and SVMtm , and for signal peptides using SignalP with both neural networks and hidden Markov models . Conflicting predictions of a signal peptide and a transmembrane region in the same sequence were resolved as follows: i) if the prediction was within 15 codons from the 5' end then the prediction was counted as a signal peptide; ii) if the signal peptide was predicted anywhere else, Phobius  was used to confirm either prediction.
The ORFs were searched for novel repeated protein domains using the methodology and software described by Yeats et al. . Prospero was used to self-search the ORFs for occurrence of internal duplications. Alignments with a length of less than 30 residues (threshold length of a protein domain) were removed. To search for repeated domains, alignments in which the start points of each sub-sequence were separated by less than 45 residues were discarded. The alignments generated by Prospero were used as an initial alignment to make profile Hidden Markov models using HMMER , and the resulting alignments were scanned against the Swiss-Prot database.
Characterization of non-coding regions
The NT sequences were scanned for functional UTR elements using UTRScan . To identify UTR elements that may occur in NTs by chance, the NTs were randomized using the program shuffle in SQUID software , and searched for motifs using UTRScan. This was repeated 10 times. The observed frequency of a motif was compared to that found in the randomized NT sequences. The motif was reported if its observed frequency was at least 4 times greater in the NTs as in the randomized NT sequences. The GC-content of each sequence was estimated with the program geecee from the EMBOSS suite . Candidate CpG islands in 5' exons and non-coding regions were identified by aligning NTs to the cattle genome. The aligned scaffold region and 200 bp upstream from the start of the alignment were extracted. These sequences were searched with the command-line version of CpG Island Searcher  using a minimum CpG island length of 200 bp, minimum GC-content of 53%, and observed/expected CpG ratio of 0.63. The choice of parameter values was based on the averages found by Takai and Jones  for exonic CpG islands. Additionally, each NT sequence was searched against itself to identify IRs and confirmed independently using the einverted program .
The NTs were analyzed for guanine-rich sequence motifs capable of forming three-dimensional structures called G-quadruplexes or G-quartets . On the basis of the algorithm by D'Antonio and Bagga , a program was written to identify the G-quartet motif in the NTs. The results were compared with results obtained using shuffled sequences as a control. QRNA  was used to search for conserved ncRNA structures amongst the NTs. The input to QRNA was a file containing the search results from a BLASTN of NTs against the non-redundant DNA sequence database. Default parameters were used, and the results were compared with another run of QRNA using shuffled data as a control.
Microarray expression analysis
The gene expression data for the NTs was extracted from a larger microarray data set obtained by profiling total RNA from 17 different tissues collected by vivisection of a one week-old Jersey calf (NCBI Gene Expression Omnibus series entry GSE3029) and term placentome. Expression patterns were obtained using a microarray containing approximately 7000 cattle placenta cDNAs spotted in duplicate on glass slides . The expression profiles for each tissue were obtained by comparing expression of each gene to a reference standard comprised of RNAs from bovine brain and three different cell lines . In order for a gene to be included in the analysis, its fluorescence intensity had to be greater than three standard deviations (SD) above the background in either the sample or reference standard. All but five of the NTs were represented on this array by a cDNA clone.
The NTs were further categorized based on their fluorescence intensity values. Expression was considered as "high" if the fluorescence intensity was greater than 5000 +3SD above background in at least one tissue, "moderate" if fluorescence intensity was between 1000 and 5000 in at least one tissue, and "low" if the fluorescence intensity was between 50 and 1000 +3SD above background in all 18 tissues.
To determine the number of transcripts that were preferentially expressed in the eighteen tissues, the ratio of the normalized intensity ratio was calculated for every pair of tissues (ratio-of-ratios) using a Perl script. For a transcript to be considered as preferentially expressed in a given tissue t, the following relationship had to hold true in at least 13 out of 17 other tissues:
R (t)/R (any other tissue) > 2,
where R is the normalized intensity ratio (tissue:reference).
Analysis of the cattle and assumptive human genome context of NTs
The number of NTs with matches in the cattle genome and those showing evidence of splicing were determined using a BLASTN search against all cattle genome sequences, including the draft assembly (March 2005, Btau_2.0, BCM-HGSC; and NCBI BACs). The NTs having evidence of "split matches" were analyzed using est2genome  in order to determine exon-intron boundaries. An NT was classified as spliced if, after alignment to genomic sequence, the intron-exon boundary had a GT/AG splice site.
To identify candidate ncRNAs amongst the NTs that lacked an ORF, the NTs were aligned to the cattle genome assembly. Whenever possible, the aligned region and 5 kb of flanking sequences was extracted from each of the aligned scaffolds. The extracted region was searched for the presence of exons by doing a BLASTX search against Swiss-Prot and non-redundant protein databases using an E-value threshold of 10-03. All NTs that lacked an ORF and aligned to sequence scaffolds with no BLASTX hits were characterized as candidate ncRNAs.
The NTs were mapped in silico to the cattle genome using BLASTN search of the NTs against the cattle genome assembly and repeat-masked cattle genome scaffolds (March 2005, Btau_2.0, BCM-HGSC). The scaffolds containing the NTs were searched for similarity to the human genome, (NCBI Build 35; E < 10-05). The chromosome number, start and end positions, and orientation were parsed for the top human BLASTN hit to the cattle assembly sequences (Additional file 5). Custom tracks were generated using these data, and loaded onto the UCSC Genome Browser [66, 28]. Using the cattle-human comparative map  also available on the UCSC Human Genome Browser, the position of the NTs on cattle chromosomes could be accurately placed. The cattle-human comparative map was also used to identify NTs that were located within 1 Mbp of evolutionary breakpoints, and within 2 Mbp of telomeres. The distribution of the 91 NTs was compared to the distribution of 91 random RefSeq genes by simulation as performed by Murphy et al. . Chi-square test was performed to determine if NTs are associated with subtelomeric regions.
In order to estimate where the genes encoding the NTs would be if they were present in the human genome, conserved sequences in cattle contigs flanking each NT were used to anchor them in the human genome. The orientation and distance of the NT in the cattle assembly was used to juxtapose the NT in the human genome sequence. The annotation of each NT in the cattle and human genomes can be found in Additional file 2.
Note added in proof
The authors recognize that the annotation of the 91 NTs may change with updated versions of RefSeq and other genome databases. A table containing updates to the annotation of these sequences will be available at the Laboratory of Mammalian Genome Biology website .
Schneider H: Placental transport function. Reprod Fertil Dev. 1991, 3: 345-353.
Mossman HW: Vertebrate Fetal Membranes. 1987, Houndmills: MacMillan, 288-290.
Crespi B, Semeniuk C: Parent-offspring conflict in the evolution of vertebrate reproductive mode. Am Nat. 2004, 163: 635-53.
Clark NL, Swanson WJ: Pervasive adaptive evolution in primate seminal proteins. PLoS Genet. 2005, 1: 0335-0342.
Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C, FANTOM Consortium; RIKEN Genome Exploration Research Group and Genome Science Group (Genome Network Project Core Group): The transcriptional landscape of the mammalian genome. Science. 2005, 309: 1559-1563.
Mattick JS: Challenging the dogma: the hidden layer of non-protein-coding RNAs in complex organisms. Bioessays. 2003, 25: 930-939.
Martens JA, Laprade L, Winston F: Intergenic transcription is required to repress the Saccharomyces cerevisiae SER3 gene. Nature. 2004, 429: 571-574.
Tupy JL, Bailey AM, Dailey G, Evans-Holm M, Siebel CW, Misra S, Celniker SE, Rubin GM: Identification of putative noncoding polyadenylated transcripts in Drosophila melanogaster. Proc Natl Acad Sci USA. 2005, 102: 5495-5500.
Roberts RM, Ealy AD, Alexenko AP, Han C-S, Ezashi T: Trophoblast interferons. Placenta. 1999, 20: 259-264.
Byatt JC, Shimomura K, Duello TM, Bremel RD: Isolation and characterization of multiple forms of bovine placental lactogen from secretory granules of the fetal cotyledon. Endocrinology. 1986, 119: 1343-1350.
Xie S, Green J, Bixby JB, Szafranska B, DeMartini JC, Hecht S, Roberts RM: The diversity and evolutionary relationships of the pregnancy-associated glycoproteins, an aspartic proteinase subfamily consisting of many trophoblast-expressed genes. Proc Natl Acad Sci USA. 1997, 94: 12809-12816.
Larson JH, Kumar CG, Everts RE, Green C, Everts-van der Wind A, Band MR, Lewin HA: Discovery of eight novel divergent homologs expressed in cattle placenta. Physiol Genomics. 2006, 25: 405-413.
Lewin HA, Larson JH, Kumar CG: From Comparative mammalian genomics and adaptive evolution: divergent homologs and novel genes in the cattle genome. Evolutionary Theory and Processes: Modern Horizons, Papers in Honour of Eviatar Nevo. Edited by: Wasser SP. 2004, The Netherlands: Kluwer Academic Publishers, 139-152.
Larson JH, Marron BM, Beever JE, Roe BA, Lewin HA: Genomic organization and evolution of the ULBP genes in cattle. BMC Genomics. 2006, 7: 227-241.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.
Wheeler DL, Church DM, Lash AE, Leipe DD, Madden TL, Pontius JU, Schuler GD, Schriml LM, Tatusova TA, Wagner L, Rapp BA: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2001, 29: 11-16.
Baylor College of Medicine. [http://www.hgsc.bcm.tmc.edu/projects/bovine]
Beaudoing E, Freier S, Wyatt JR, Claverie J-M, Gautheret D: Patterns of variant polyadenylation signal usage in human genes. Genome Res. 2000, 10: 1001-1010.
Kozak M: Possible role of flanking nucleotides in recognition of the AUG initiator codon by eukaryotic ribosomes. Nucleic Acids Res. 1981, 9: 5233-5252.
Grundner-Culemann E, Martin GW, Harney JW, Berry MJ: Two distinct SECIS structures capable of directing selenocysteine incorporation in eukaryotes. RNA. 1999, 5: 625-635.
Takai D, Jones PA: The CpG island searcher: a new WWW resource. In Silico Biol. 2003, 3: 235-240.
Simonsson T: G-quadruplex DNA structures – variations on a theme. Biol Chem. 2001, 382: 621-628.
Everts-van der Wind A, Larkin DM, Green CA, Elliott JS, Olmstead CA, Chiu R, Schein JE, Marra MA, Womack JE, Lewin HA: A high-resolution whole-genome cattle-human comparative map reveals details of mammalian chromosome evolution. Proc Natl Acad Sci USA. 2005, 102: 18526-18531.
National Center for Biotechnology Information. [http://www.ncbi.nlm.nih.gov/]
Kim J, Bergmann A, Lucas S, Stone R, Stubbs L: Lineage-specific imprinting and evolution of the zinc-finger gene ZIM2. Genomics. 2004, 84: 47-58.
Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997, 268: 78-94.
UCSC Genome Browser. [http://genome.ucsc.edu/]
Sharp PM, Cowe E: Synonymous codon usage in Saccharomyces cerevisiae. Yeast. 1991, 7: 657-678.
Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Barrell D, Bateman A, Binns D, Biswas M, Bradley P, Bork P: The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res. 2003, 31: 315-318.
LeBlanc MD, Aspeslagh G, Buggia NP, Dyer BD: An annotated catalog of inverted repeats of Caenorhabditis elegans chromosomes III and X, with observations concerning odd/even biases and conserved motifs. Genome Res. 2000, 10: 1381-1392.
Keese PK, Gibbs A: Origin of genes: "Big Bang" or continuous creation?. Proc Natl Acad Sci USA. 1992, 89: 9489-9493.
Long M, Betrán E, Thornton K, Wang W: The origin of new genes: glimpses from the young and old. Nat Rev Genet. 2003, 4: 865-875.
Linardopoulou EV, Williams EM, Fan Y, Friedman C, Young JM, Trask BJ: Human subtelomeres are hot spots of interchromosomal recombination and segmental duplication. Nature. 2005, 437: 94-100.
de Beer GR: Embryos and Ancestors. 1951, Oxford: Clarendon Press, 19-20.
Raff RA: The Shape of Life: Genes, Development, and the Evolution of Animal Form. 1996, Chicago: The University of Chicago Press
McDonald JF: The molecular basis of adaptation: A critical review of relevant ideas and observations. Ann Rev Ecol Syst. 1983, 14: 77-102.
Brosius J: RNAs from all categories generate retrosequences that may be exapted as novel genes or regulatory elements. Gene. 1999, 238: 115-134.
Hughes AL, Green JA, Garbayo JM, Roberts RM: Adaptive diversification within a large family of recently duplicated placentally expressed genes. Proc Natl Acad Sci USA. 2000, 97: 3319-3323.
Schmid KJ, Aquadro CF: The evolutionary analysis of "orphans" from the Drosophila genome identifies rapidly diverging and incorrectly annotated genes. Genetics. 2001, 159: 589-598.
Gilbert W, deSouza SJ, Long M: Origin of genes. Proc Natl Acad Sci USA. 1997, 94: 7698-7703.
Olson MV: When less is more: gene loss as an engine of evolutionary change. Am J Hum Genet. 1999, 64: 18-23.
Yamauchi A, Nakashima T, Tokuriki N, Hosokawa M, Nogami H, Arioka S, Urabe I, Yomo T: Evolvability of random polypeptides through functional selection within a small library. Protein Eng. 2002, 15: 619-626.
Schmid KJ, Tautz D: A screen for fast evolving genes from Drosophila. Proc Natl Acad Sci USA. 1997, 94: 9746-9750.
Domazet-Loso T, Tautz D: An evolutionary analysis of orphan genes in Drosophila. Genome Res. 2003, 13: 2213-2219.
Grassé P-P: Evolution of Living Organisms. 1977, New York: Academic Press, 297-
Ohno S: Birth of a unique enzyme from an alternative reading frame of the preexisted, internally repetitious coding sequence. Proc Natl Acad Sci USA. 1984, 81: 2421-2425.
Zhang J, He X: Significant impact of protein dispensability on the instantaneous rate of protein evolution. Mol Biol Evol. 2005, 22: 1147-55.
Band MR, Olmstead C, Everts RE, Liu ZL, Lewin HA: A 3800 gene microarray for cattle functional genomics: comparison of gene expression in spleen, placenta and brain. Anim Biotechnol. 2002, 3: 163-172.
Huang X: An improved sequence assembly program. Genomics. 2002, 33: 21-31.
NCBI's ORF finder. [http://www.ncbi.nlm.nih.gov/gorf/gorf.html]
Frishman D, Mironov A, Mewes HW, Gelfand M: Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Res. 1998, 26: 2941-2947. Erratum in: Nucleic Acids Res 1998, 26: 3870.
Krogh A, Larsson B, von Heijne G, Sonnhammer ELL: Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001, 305: 567-580.
Yuan Z, Mattick JS, Teasdale RD: SVMtm: Support vector machines to predict transmembrane segments. J Comput Chem. 2004, 25: 632-636.
Nielsen H, Krogh A: Prediction of signal peptides and signal anchors by a hidden Markov model. Proc Int Conf Intell Syst Mol Biol. 1998, 6: 122-130.
Käll L, Krogh A, Sonnhammer ELL: A combined transmembrane topology and signal peptide prediction method. J Mol Biol. 2004, 338: 1027-1036.
Yeats C, Bentley S, Bateman A: New knowledge from old: In silico discovery of novel protein domains in Streptomyces coelicolor. BMC Microbiol. 2003, 3: 3-23.
Eddy SR, Mitchison G, Durbin R: Maximum discrimination hidden Markov models of sequence consensus. J Comput Biol. 1995, 2: 9-23.
Pesole G, Gissi C, Catalano D, Grillo G, Licciulli F, Liuni S, Attimonelli M, Saccone C: MitoNuc and MitoAln: two related databases of nuclear genes coding for mitochondrial proteins. Nucleic Acids Res. 2000, 28: 163-165.
Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000, 16: 276-277.
D'Antonio L, Bagga P: Computational methods for predicting intramolecular G-quadruplexes in nucleotide sequences. Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB'04). 2004, 561-562.
Rivas E, Eddy SR: Non-coding RNA gene detection using comparative sequence analysis. BMC Bioinformatics. 2001, 2: 8-27.
Everts RE, Band MR, Lewis ZL, Kumar CG, Liu L, Loor JL, Oliveira R, Lewin HA: A 7,872 cDNA Microarray and its use in bovine functional genomics. Vet Immunol Immunopathol. 2005, 105: 235-245.
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12: 996-1006.
Murphy WJ, Larkin DM, Everts-van der Wind A, Bourque G, Tesler G, Auvil L, Beever JE, Chowdhary BP, Galibert F, Gatzke L: Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science. 2005, 309: 613-617.
Laboratory of Mammalian Genome Biology. [http://lewinlab.igb.uiuc.edu/Research/NovelGenes.html]
Schwartz S, Zhang Z, Frazer KA, Smit A, Riemer C, Bouck J, Gibbs R, Hardison R, Miller W: PipMaker – a web server for aligning two genomic DNA sequences. Genome Res. 2000, 10: 577-586.
The authors thank Dr. Robin Everts for placentome microarray data, and Dr. Denis Larkin for assistance with comparative mapping. This study and the authors' contributions to it were funded by a grant to HAL from the USDA National Research Initiative (AG2002-35205-11625). These funding bodies played no role in the design, collection, analysis, interpretation, writing, or the decision to submit the manuscript for publication.
CGK participated in the design of the study, wrote the computer programs for the pipeline, EST clustering, expression and genome analysis, conducted the sequence analysis and interpretation of data, and drafted the manuscript. JHL participated in the design of the study, and assisted in the collection and analysis of DNA sequence data. MRB contributed microarray data. HAL supervised the research, participated in the design of the study, interpretation of data and drafting of the manuscript. All authors have read the manuscript, provided critical reviews of content, and approved the final manuscript.