Expression sequence tag library derived from peripheral blood mononuclear cells of the chlorocebus sabaeus

Background African Green Monkeys (AGM) are amongst the most frequently used nonhuman primate models in clinical and biomedical research, nevertheless only few genomic resources exist for this species. Such information would be essential for the development of dedicated new generation technologies in fundamental and pre-clinical research using this model, and would deliver new insights into primate evolution. Results We have exhaustively sequenced an Expression Sequence Tag (EST) library made from a pool of Peripheral Blood Mononuclear Cells from sixteen Chlorocebus sabaeus monkeys. Twelve of them were infected with the Simian Immunodeficiency Virus. The mononuclear cells were or not stimulated in vitro with Concanavalin A, with lipopolysacharrides, or through mixed lymphocyte reaction in order to generate a representative and broad library of expressed sequences in immune cells. We report here 37,787 sequences, which were assembled into 14,410 contigs representing an estimated 12% of the C. sabaeus transcriptome. Using data from primate genome databases, 9,029 assembled sequences from C. sabaeus could be annotated. Sequences have been systematically aligned with ten cDNA references of primate species including Homo sapiens, Pan troglodytes, and Macaca mulatta to identify ortholog transcripts. For 506 transcripts, sequences were quasi-complete. In addition, 6,576 transcript fragments are potentially specific to the C. sabaeus or corresponding to not yet described primate genes. Conclusions The EST library we provide here will prove useful in gene annotation efforts for future sequencing of the African Green Monkey genomes. Furthermore, this library, which particularly well represents immunological and hematological gene expression, will be an important resource for the comparative analysis of gene expression in clinically relevant nonhuman primate and human research.


Background
Nonhuman primates (NHP) are used in many areas of biomedical research because of their close relationship to humans. Indeed, for some human diseases, such as for HCV and HIV infections, they still represent the only available animal model. Moreover, optimal drug safety assessment and vaccine development are in many instances dependent on NHPs. Nowadays, the knowledge of their genome and transcriptome becomes critical for an efficient and parsimonious use of these models. The genome of the Chimpanzee (Pan troglodytes) [ 1 ] , Indian rhesus macaque (Indian Macaca mulatta) [ 2 ] , Orangutan (Pongo abelii [3], Chinese rhesus macaque (Chinese Macaca mulatta [4] and Cynomolgus macaque (Macaca fascicularis [4,5] have been sequenced, and sequencing of several other NHP genomes is ongoing [6,7]. The African Green Monkey (AGM) is a widely used species in biomedical research for studies in the field of immunology, neuroscience (such as Parkinson's disease [8,9], cardiovascular disease [10], cell biology [11][12][13], pharmacology [14] and infectious diseases [15][16][17][18][19]. AGMs are one of the 40 natural hosts of the Simian Immunodeficiency Virus (SIV). They are particularly interesting http://www.biomedcentral.com/1471-2164/13/279 models for studying of AIDS as this species is protected against the disease. Despite chronic infection by SIV, they generally do not develop any clinical symptoms [19,20] and hence are used to identify correlates of protection [19][20][21]. AGMs are divided into four species, named vervet (Chlorocebus pygerythrus), grivet (Chlorocebus aethiops), sabaeus (Chlorocebus sabaeus) and tantalus ( Chlorocebus tantalus). Among them, the vervet and sabaeus species have been most extensively studied [22][23][24][25][26][27]. Three hundred years ago, AGMs that belonged to the C. sabaeus species were transferred during slave trade from West/Central Africa to the Caribbean islands [28]. The only large breeding centers for AGMs are now located in these Islands, and the C. sabaeus species is now becoming the most studied AGM model for SIV and in biomedical research in general. In the context of viral infections such as with SIV, one of the main issues for the development of treatments and vaccines against human diseases, is to better understand the host transcriptomic responses of immune cells as the host immune response is mainly responsible for the outcome of the infection. Moreover, due to the important amount of genes expressed in case of activation, immune cells are relevant for revealing significant parts of the host transcriptome. So far, research involving AGMs, especially using gene expression profiling, were limited by the lack of sufficient gene sequence information and most studies were dependent on tools developed for human and more recently macaque species [7,29,30]. This limitation is a major problem since sequenced genes from AGMs revealed significant nucleotide differences from the human and even the macaque genomes [31][32][33], and more information on AGM gene sequences are therefore urgently needed. It should be noted however, that the difference between NHP and humans is higher at the level of which gene is expressed, rather than at the nucleotide diversity level [34]. In addition, it has been shown that NHP cells express additional genes that are not expressed in humans [35], and we have shown in a previous study that C. sabaeus express up to 16,000 genes in peripheral CD4+ cells, with 990 being specific of the species [36]. Annotating such sequences is challenging given that limited information is available and only few hundreds sequences of C. sabaeus are currently present in the GeneBank [37] database (Additional file 1: Figure S1). In this study, we constructed, sequenced and annotated a C. sabaeus EST (Expression Sequence Tag) library obtained from Peripheral Blood Mononuclear Cells (PBMC), as a tool for annotating AGM reference genomes, in order to allow the generation of technologies dedicated to analyze the immune responses in this species, as well as providing immediate valuable information to better understand the molecular and cellular mechanisms involved in AIDS resistance.

Composition and assembly of the Chlorocebus sabaeus PBMC EST library
Our aim was to obtain the sequence information for the genes expressed in C. sabaeus immune cells. In order to be representative, we collected fresh PBMC from twelve SIV-infected and four non infected animals. In order to identify as many distinct transcripts as possible, we in vitro stimulated these cells or not with Concanavalin A (ConA), lipopolysaccharides (LPS) and by mixed lymphocyte reactions (MLR), as these stimuli upregulate mRNA expression of many genes. The different stimuli were chosen to activate distinct cellular receptors (T cell receptor, Toll-like receptors) and stimulate distinct immune cells (lymphocytes and antigen-presenting cells). Total RNA preparations from the stimulated and unstimulated cells were pooled and a cDNA library constructed. Sequences were obtained and sequence quality filtering showed that 37,787 ESTs were present in the library. They had a mean length of 563 nucleotides per EST with a standard deviation of 167 nucleotides ( Figure 1A). The 37,787 ESTs have been assembled into 3,853 contigs (overlapping or embedded ESTs) and 10,557 singletons (not assembled ESTs). The median number of ESTs per contig was 3 with some outlier contigs being composed of up to 941 ESTs ( Figure 1B). The mean length of the 14,410 assembled and singletons ESTs averages at 943 nucleotides ( Figure 1C). The total length represented by our AGM EST library is about 21.10 6 nucleotides and the total length of the assembled distinct transcripts 9.10 6 nucleotides. Since the total length of the known M. mulatta distinct transcripts corresponds to 72.10 6 nucleotides in the Ensembl database [38], our AGM sequences represent 12% of the M. mulatta transcriptome and potentialy a similar fraction of the AGM transcriptome.

Inter-and intra-species comparisons
We then compared the ESTs to available transcriptomes of other primate species for annotation purposes and for quantification of transcript homologs. In order to get a general as well as a specific view, we used both the total 37,787 ESTs of the original library and the assembled distinct transcript library. They were aligned to available cDNA datasets of the following ten pri-  least one species. 29,191 of the total ESTs and 7,985 of the assembled ESTs have been aligned on at least two cDNA references. 1,628 of the total ESTs and 135 of the assembled ESTs have been aligned on all the 10 species, while 6,576 of the total ESTs and 5,384 of the assembled ESTs could not be mapped to any cDNA reference sets and are then potentially specific to the C. sabaeus transcriptome or highly diverse orthologs ( Figure 1D). ESTs of the total and assembled AGM libraries have also been aligned to the draft assembly of the M. fascicularis genome [5] and a sequencing read library of the C. sabaeus genome [39]. Alignment results have been filtered to only keep for each EST the 5 best mapped reads when possible of the C. sabaeus draft scaffold genome, and the best mapped genomic position on the M. fascicularis draft assembly genome. Table 2 provides a summary of the results of the alignments with the 10 cDNA references and the 2 draft genomes. The highest number of aligned ESTs for both the original and the assembled ESTs was found for the H. sapiens (∼80% of the original ESTs and ∼60% of the distinct transcripts) probably due to the relatively higher degree of investigation of this genome. The higher frequence as to compared to the ones of NHP is thus due to the broader sequence information from human genomes and does not reflect the biological dis-   and thus is not directly comparable to the alignments of the EST libraries. Overall, while giving different specific alignment information, the number of mapped transcripts and mapped genes for both the 37,787 originals ESTs and the 14,410 distinct transcripts are convergent in the number of mapped genes and proportional with the genomic distances that exist among these species.

Specific comparison with the Macaca mulatta transcriptome
The M. mulatta species is the closest primate species to the C. sabaeus for which significant genomic information is available. In order to gain additional information about the transcript fragments that we provide, we annotated them with the particular section positions of the messenger RNAs available for the M. mulatta species.
We specified for each EST of the assembled library the positions of the 5'-untranslated region (5'UTR), coding DNA sequence (CDS), and 3'-untranslated region (3'UTR) based on the M. mulatta cDNA reference annotations. Among the 14,410 assembled ESTs, 11,211 could be annotated: 6,244 ESTs with the 5'UTR, 9,657 ESTs with the CDS, and 5,313 ESTs with the 3'UTR.
We report 506 M. mulatta transcripts that have been mapped to more than 90% by an EST (Additional file 2: Table S1). CXCL10 ( Figure 3) and S100A4 (Additional file 3: Figure S2) are part of these transcripts and given as examples.

Quantification of expressed sequences and functional pathway analysis of the EST library
In order to have a quantitative view of the expressed sequences of the C. sabaeus PBMC, we identified the most expressed transcripts in our EST library based on the M. mulatta homolog transcripts. Based on the 44,725 transcripts of the M. mulatta cDNA reference and the 14,410 ESTs of the original ESTs library, we calculated for each transcript the number of sequences mapped and obtained a list of the 50 most expressed M. mulatta ortholog transcripts in our EST library (Table 3). Among these most expressed transcripts, the hemoglobin beta (HBB) and alpha (HBA) genes were present, which might reflect a red blood cells contaminations of the PBMC, as well as more specific immune-related genes, such as CD74 and Granzyme B (GZMB). Some EST which correspond to genes which play an important role in immune responses against pathogens have been aligned: IRF7 ( Figure 4), CD4 (Additional file 4: Figure S3), IFNG (Additional file 5: Figure S4), IFNGR1 (Additional file 6: Figure S5), IFNGR2 (Additional file 7: Figure S6). For all these transcripts, EST alignment positions as well as protein domains are given. Furthermore, in order to identify the over-represented pathways in our AGM EST library, we performed a functional canonical pathway analysis based on the list of the 9,208 H. sapiens transcripts uniquely mapped by the 37,787 original ESTs. Most of the canonical pathways found as statistically significantly over-represented are related to B and T cell signaling, and immune response pathways (Table 4). For instance, the "CD28 signaling in T Helper Cells", "iCOS-iCOSL signaling in T Helper Cells", "B Cell receptor Signaling" (Additional file 8: Figure S7A), and "T Cell receptor signaling" (Additional file 8: Figure  S7B) pathways belong to the list of pathways found as significantly over-represented in our AGM library, as well as the "Glucocorticoid receptor signaling", "Role of NFAT in regulation of the immune response" (Additional file 9: Figure S8A), "Antigen presentation pathway" (Additional file 9: Figure S8B), "JAK/STAT signaling", and many different "Interleukin signaling" pathways. As a result of the in vitro stimulation of SIV-infected PBMC, the "NF-κB Activation by viruses" (Additional file 10: Figure S9A) and "Induction of apoptosis by HIV-1" (Additional file 10: Figure S9B) pathways are also significantly overrepresented. Consistent with the stimulation by LPS, the "Interferon Signaling" ( Figure 5A) and "Toll-like Receptor Signaling" ( Figure 5B) pathways are also found significantly over-represented. Finally, ConA is capable of triggering positive selection in mature T cells by crosslinking the TCR with high avidity [40,41] and we found 8 pathways corresponding to these functions being induced ( Table 4). The over-representation of gene transcripts belonging to these pathways of the immune system further indicates that this library is a valuable resource for profiling global gene expression in AGM immune cells. Overall, these gene and pathway information are consistent with what we could expect from an EST PBMC library.

Analysis of the inter-species genomic relationships
Analysis of genomic relationships among species is an important way for studying evolution of genomic features. The relationships of the C. sabaeus EST with the 10 above described primate species for which the cDNA references were available have been quantified.  (Table 5) and a phylogenetic tree constructed ( Figure 6A). As it would be expected, the H. sapiens and P. troglodytes are clustered together, as it is also the case for the C. sabaeus and M. mulatta.T h eG. gorilla, P. ab e li i ,a n dN. leucogeny were located between these two clusters, and the C. jacchus, M. murinus, O. garnettii,a n dT. syricht a are segregated from other species. By comparing only to the more related species, phylogenetic trees have also been computed with a higher number of AGM ESTs with all the 14,410 assembled transcripts ( Figure 6B). Finally, trees have been http://www.biomedcentral.com/1471-2164/13/279 constructed for specific sections of the transcripts: 5'UTR regions ( Figure 6C), CDS sections ( Figure 6D), and 3'UTR regions ( Figure 6E). Both the phylogenetic trees restricted on the CDS and 3'UTR sections show a clusterisation of the C. sabaeus with the M. mulatta and a strong segregation with other species. Interestingly, the phylogenetic tree restricted on the 5'UTR sections revealed a different shape. C. sabaeus and the N. leucogeny species clustered together, suggesting distinct selective pressures in the 5'UTR as compared to other regions.

Discussion
AGMs have provided useful animal models in biomedical research for many years [17,[42][43][44][45][46][47][48][49][50]. They are also becoming a more and more essential model to the study of human biology and disease, such as neurological disorders [51,52] and AIDS [19,53]. Several studies could not be conducted so far because of the insufficiency of genomic resources on this primate [7]. This is a major limitation in view of the information that new generation technologies can offer for the progress in development of strategies to http://www.biomedcentral.com/1471-2164/13/279  prevent or treat human diseases. The growing interest for this model is shown through the increase of the number of sequences published in the NCBI nucleotide database for this species every year (Additional file 1: Figure S1) and the sequencing of its genome which is underway. Nevertheless, our recent survey (as of January 11, 2012) showed that while there were 11,413,043 and 225,854 nucleotide sequences available for H. sapiens and M. mulatta,respectively, there were still only 2,527 nucleotide sequences for AGM in the databases. The primary goal of this study was to enhance the development of an AGM genomic resource through the construction, sequencing and characterization of a PBMC cDNA library of AGM (C. sabaeus).
The results could be used to expand genomic research activities on this species. We focused here on the construction of a cDNA library on blood immune cells (PBMC) in order to get as much immune defense genes as possible which could help to the study of several disease mechanisms such as the understanding of AIDS resistance in AGM. Therefore, to increase the expression of such genes, the cells were challenged or not with immune-relevant stimuli (ConA, LPS, MLR). We also chose to work on both, SIV-infected and non-infected animals, in order to eventually reveal new genes that might have a unique role in AIDS resistance in this natural host. The sequencing of the cDNAs yielded 37,787 ESTs with 14,410 assembled and singletons ESTs which cover 12% of the transcriptome. For annotation purpose, we aligned the 14,410 cDNA sequences of our library to the known cDNA libraries of 10 other primate species including the human one. Of the 31,005 ESTs identified, as many as 6,576 ESTs did not match any gene reported in the database. This high number of novel sequences might be due to the fact that the genomes of the other NHP species are not sufficiently annotated yet. H o w e v e r ,af e wo ft h e mm i g h tb et r u en e wg e n ec a n d idates. Indeed, the stimulation used might have revealed a number of silent genes only expressed under the condition of infection.
As one would expect, at the CDS level, the divergence between C. sabaeus and M. mulatta was lower than between C. sabaeus and the other primate species. The gene distance at the non-coding regions was higher than in the CDS and higher for 5'UTR than 3'UTR. Interestingly, the 5'UTR of C. sabaeus did not cluster any more with M. mulatta, at least not consistently. This is in line with the fact that on average, 5' and 3' UTRs are less conserved across species than protein-coding sequences, with the 5'UTR being the most divergent, but still more conserved than untranscribed sequences [54,55]. It has been shown that high differences in the 5'UTR of orthologous genes correlate with their expression levels [56]. Indeed, this region is rich in regulatory elements. Changes in the regulation of gene expression levels play an important role in phenotypic diversity among closely related organisms [57,58]. The high distances observed at the 5'UTR region between the different primate species studied here might reflect part of these changes (Additional file 11: Table S2). However, we can not exclude that on one hand, our analyses might have misestimated the gene distance of UTR or ESTs in general, between the C. sabaeus and other species, as the length of the ESTs of our library are shorter (943 nucleotides) than the average length of human and macaque cDNAs (1,500 nucleotides) in the databases. On the other hand we might have overestimated this distance as compared to the rest of the transcriptome because the genes included in this library are mostly immune-related, thus among the most known divergent genes [59].
To further analyze the library, we determined the biological pathways represented by the 14,410 annotated ESTs. Among more general pathways (protein ubiquitination pathway, mitochondrial pathway), many pathways were related to the immune system (T cell activation,    List of the top 50 canonical pathways found as statistically significantly over-represented in the functional pathway analysis of the EST library. For each canonical pathway, the associated multiple testing corrected p-value (shown as −log(q-value)) is indicated as well as the ratio between the number gen a of genes of the pathway mapped by the EST library and the total number gen b of genes defining the pathway.
using cell lines derived from AGM such as COS-7 and Vero cells. We studied in more detail genes which are of major importance for host immune defenses, such as IFN-γ , IFNGR, CXCL10 and IRF7 [60][61][62]. For the IFN-γ receptor (IFNGR), it has been shown in humans that any variation having a significant impact on IFNGR function is not tolerated [63]. Therefore, the deletion observed in the cytoplasmic tail of IFNGR1 in AGM as compared to macaque might either not have any functional consequence on this pathway or give to this species a yet unknown evolutionary advantage. Thus, it would be interesting to compare the sequence of AGM IFNGR1 with other SIV natural hosts in order to evaluate if this might play a role in AIDS resistance. CXCL10 (or IP-10) is a chemokine involved in the recruitment of cells of the immune system to sites of inflammation and is induced by IFN-α and IFN-γ [60]. Alteration of IP-10 expression has been associated with inflammatory diseases including infectious diseases, immune dysfunction and tumor development [64,65]. We did not find any difference at the amino acid level between the CXCL10 from C. sabaeus and the one from M. mulatta.T h i sc o n s e r v a t i o ns u ggests that any variation having an impact on CXCL10 function could be deleterious. IRF7 encodes a transcription factor which plays a role in the activation of virusinducible cellular genes, including the type I interferon genes. The partial sequences from our library did not show the same mutations that were suggested to play a role in AIDS-resistance in another SIV-natural host, the sooty mangabey [66]. The mutations in IRF7 reported in one SM [66], were however also either not confirmed when studied in a large number of SM animals or found to be non-fixed and with no effects on the phenotype even w h e np r e s e n ti nh o m o z y g o s i t y( J o h n s o nZ ,S i l v e s t r iG , and Bosinger SE, personal communication). However, as this is not the same species, the mutations could be at other sites, or the mechanisms of AIDS resistance might be different between AGM and sooty mangabey. As our library was constructed on a pool of cells from 16 different animals, the sequences obtained are not representative of the inter-individual variability and need to be verified on the individual level for further studies.
To our knowledge, this is the first time that abundant genetic information on AGM is given. In this study, a total of 37,787 ESTs were sequenced, from which 14,410 contigs and singletons were identified, covering 12% of the AGM transcriptome. Moreover, this cDNA library provides both a large collection of novel transcripts and a detailed annotation of immune genes. The high volume of apparently novel AGM sequences suggests that our data could be a useful resource for future genomic investigation.

Construction and sequencing of the EST library
Twelve SIV-infected and four non-infected C. sabaeus (from Caribbean islands) were used in this study. The Central Committee for Animals at Institut Pasteur, Paris, France, reviewed and approved the use and care of animals. The experiments were performed according to national and European guidelines. Whole blood was collected from monkeys under anesthesia in heparinized tubes. PBMC were isolated from whole blood by density gradient centrifugation using the Lymphocyte Separation Medium 1077 (PAA Laboratories GmbH) and activated or not with different stimuli in RPMI-1640 with 10% fetal calf serum. For ConA activation (from Canavalia ensiformis (Sigma-Aldrich, St. Louis, MO, USA)): 4.10 6 of isolated PBMC were plated with 10µg.ml −1 ofCon Af or2,6,24, 36 or 72h. For LPS (E.Coli 0111:B4 Sigma (L2630)) activation: 4.10 6 of isolated PBMC were plated with 10µg.ml −1 of LPS for 2, 6, 24, 36 or 72 hours. The MLR were done by mixing 4.10 6 of isolated PBMC with 4.10 5 PBMC from another animal for 2, 6, 24, 36 or 72h. Unstimulated cells were also kept for further RNA extraction. Total RNA was extracted from harvested cells by using the RNeasy® Mini Kit (Qiagen, Courtaboeuf, France) following the manufacturer's instructions. Briefly, cells were lysed in 350µl of RLT buffer, run over a QiaShredder column (Qiagen) to ensure homogeneous lysis, and resuspended in 30µl http://www.biomedcentral.com/1471-2164/13/279   of sterile water. We added a DNase-RNase free (Qiagen) treatment on the column to eliminate any potential DNA contamination of RNA preparations. The quality and concentration of RNA was assessed as before [36]. The libraries were plated, arrayed robotically and bacterial clones have their plasmid DNA amplified using phi29 polymerase. The plasmids were end-sequenced by the Genoscope using BigDye Termination kits on Applied Biosystems 3730xl DNA Analysers.

EST quality filtering
Poly-A and poly-T tails have been trimmed from the sequenced ESTs by using the trimest tool [67] (default parameters have been used) while starting and ending terminal N's have been trimmed from the sequences using the trimseq tool [67] (a threshold cutoff parameter of 20% of Ns in a window of 30 nucleotides has been used).

Assembly of the EST library
Assembly of ESTs into contigs has been performed using the EGassembler [68] tool. EGassembler aligns and merges sequence fragments resulting from shotgun sequencing or gene transcripts fragments in order to reconstruct the original segment or gene (an overlap identity cutoff parameter of 80% has been used).

cDNA references and genomes used in this study
The C. jacchus, G. gorilla, H. sapiens, M. mulatta, M. murinus, N. leucogeny, O. garnettii, P. troglodytes, P. ab e li i ,and T. syricht a cDNA references have been retrieved from the Ensembl [38] database. The sequencing of the C. sabaeus genome is currently in progress as part of an international collaborativeeffortattheWashingtonUniversityGenome Center [39] and the draft scaffold genome release of this project has been used in this study. The draft assembly of the M. fascicularis genome used in this study is available through the ENA [69] database via accession numbers from FR874244 to FR874264 [5].

ESTs alignment procedures
Alignment of the ESTs on the cDNA references and on the M. fascicularis draft assembly genome has been done using the BLAST tool [70] (an Expect value cutoff parameter of 10 has been used). Alignment results have been filtered to only keep for each EST the best alignment for each species that has at least a support of 80% with the EST sequence. Alignment of the ESTs on the C. sabaeus draft scaffold genome has been performed using the CBRC-LAST [71] based online tool available on the website of the Washington University Genome Center [72].

Functional pathway analysis
The functional pathway analysis of the EST library has been performed using Ingenuity Pathways Analysis (IPA, Ingenuity® Systems). IPA examines expressed genes in the context of known biological functions and pathways, mapping each gene identifier in a dataset to its corresponding molecule in the Ingenuity Pathways Knowledge Base (IPKB). P-values attributed to each pathway representing the statistical over-representation significance have been calculated by using the right-tailed Fisher's exact test and have been adjusted using the Benjamini-Hochberg Multiple Testing correction [73]. Over the 9,208 H. sapiens http://www.biomedcentral.com/1471-2164/13/279 transcripts uniquely mapped by the 37,787 original ESTs, 8,579 have been identified by the IPKB and then used in the functional analysis.

Quantification of the evolutionary relationships and construction of the phylogenetic trees
Quantification of the evolutionary relationships among ESTs and EST mapped sequences has been performed using the Needleman-Wunsch multiple alignment algorithm [74]. Distance among sequences has been calculated using the Jukes-Cantor method [75] (maximum likelihood estimate) based on the NUC44 scoring matrix. Phylogenetic trees have been constructed by using the Unweighted Pair Group Method Average linking method (UPGMA, group average [76].