Contrasted evolutionary constraints on secreted and non-secreted proteomes of selected Actinobacteria

Background Actinobacteria have adapted to contrasted ecological niches such as the soil, and among others to plants or animals as pathogens or symbionts. Mycobacterium genus contains mostly pathogens that cause a variety of mammalian diseases, among which the well-known leprosy and tuberculosis, it also has saprophytic relatives. Streptomyces genus is mostly a soil microbe known for its secondary metabolites, it contains also plant pathogens, animal pathogens and symbionts. Frankia, a nitrogen-fixing actinobacterium establishes a root symbiosis with dicotyledonous pionneer plants. Pathogens and symbionts live inside eukaryotic cells and tissues and interact with their cellular environment through secreted proteins and effectors transported through transmembrane systems; nevertheless they also need to avoid triggering host defense reactions. A comparative genome analysis of the secretomes of symbionts and pathogens allows a thorough investigation of selective pressures shaping their evolution. In the present study, the rates of silent mutations to non-silent mutations in secretory proteins were assessed in different strains of Frankia, Streptomyces and Mycobacterium, of which several genomes have recently become publicly available. Results It was found that secreted proteins as a whole have a stronger purifying evolutionary rate (non-synonymous to synonymous substitutions or Ka/Ks ratio) than the non-secretory proteins in most of the studied genomes. This difference becomes statistically significant in cases involving obligate symbionts and pathogens. Amongst the Frankia, secretomes of symbiotic strains were found to have undergone evolutionary trends different from those of the mainly saprophytic strains. Even within the secretory proteins, the signal peptide part has a higher Ka/Ks ratio than the mature part. Two contrasting trends were noticed amongst the Frankia genomes regarding the relation between selection strength (i.e. Ka/Ks ratio) and the codon adaptation index (CAI), a predictor of the expression rate, in all the genes belonging to the core genome as well as the core secretory protein genes. The genomes of pathogenic Mycobacterium and Streptomyces also had reduced secretomes relative to saprophytes, as well as in general significant pairwise Ka/Ks ratios in their secretomes. Conclusion In marginally free-living facultative symbionts or pathogenic organisms under consideration, secretory protein genes as a whole evolve at a faster rate than the rest and this process may be an adaptive life-strategy to counter the host selection pressure. The higher evolutionary rate of signal peptide part compared to mature protein provides an indication that signal peptide parts may be under relaxed purifying selection, indicative of the signal peptides not being secreted into host cells. Codon usage analysis suggests that in actinobacterial strains under host selection pressure such as symbiotic Frankia, ACN, FD and the pathogenic Mycobacterium, codon usage bias was negatively correlated to the selective pressure exerted on the secretory protein genes.


Background
Frankia is a taxon comprising nitrogen-fixing actinobacteria that establish a root symbiosis with dicotyledonous plants belonging to 8 plant families [1]. The phylogeny of these bacteria as determined by classic 16S rRNA sequence analysis clusters them into 4 clusters [2]. The genus Frankia appears to have emerged from a group of soil and rhizosphere actinobacterial genera [3], many of which are extremophiles such as the thermophilic Acidothermus [2], the gamma-radiation resistant Geodermatophilus [4] or the compost-inhabiting Antarctic-dwelling Sporichthya [5]. Frankia exhibits only a few distinct morpho-physiological features including a distinctive wall sugar [6] and unique specialized structures (termed vesicles) that are surrounded by an envelope containing oxygenimpermeable bacteriohopanetetrol, which serves to protect nitrogenase [7].
Little is known about the nature of the symbiotic determinants involved in the actinorhizal symbiosis. On the host plant-side, the SymRK gene, a transmembrane kinase, has been identified and shown to control development of both actinorhizal nodulation and the mycorhizal infection processes [8]. Furthermore, the actinorhizal host plants, Alnus and Casuarina, have homologs of this whole symbiotic cascade [9]. On the bacterial-side of the symbiosis, the absence of a well-established reliable genetic system has hindered attempts to identify essential genes involved in the process. However, sequence analysis of three Frankia genomes representing contrasting host specificity ranges failed to reveal the presence of genes homologous to the Rhizobium nod genes [10] or symbiotic islands [11]. These genomes were found to have undergone contrasted evolutionary pressures resulting in marked differences in their size, transposase content and loss-gain of several determinants. Relative to their saprophytic neighbors, Frankia genomes have a reduced number of secreted proteins [12] although these predictions have not been consistently confirmed experimentally [13]. This is evocative of a genome-wide strategy to keep a chemical "low-profile" inside host plant cells.
Streptomyces genus is emblematic of soil microbes with its rich array of secondary metabolites that have been exploited for a long time with many powerful drugs ever since streptomycin was characterized [14]. Most Streptomyces species are described as saprophytes except for a few such as the potato pathogens S. scabiei and related species [15] and the human pathogen S. somaliensis [16]. There have also been a number of strains recently described as symbionts or commensals but the exact nature of their interaction with their host is still not clear.
Mycobacterium genus is better known for its two terrible disease-causing species, M. tuberculosis, agent of tuberculosis [17] and M. leprae, agent of leprosy [18] as well as a few less known ones such as M. ulcerans.
Beside these pathogens, there is a number of saprophytic species such as the pyrene-degrading soil M. vanbaalenii [19], or the commensal/environmental M. smegmatis [20].
Intracellular bacteria interact in an intimate fashion with host cells, thus facing a paradoxical challenge. Their interactions with their cellular environment through secreted proteins and effectors transported through transmembrane systems may trigger a host defense response that they would then need to fight off. Host cells have elaborate sensing systems to detect motifs that are specific for different classes of pathogens and subsequently trigger defense reactions including synthesis of cysteinerich defensins, oxygen radicals, or toxic aromatics, which would be detrimental to symbionts [21]. Pathogenic microbes have thus evolved in close interaction with their hosts, in a gene-for-gene pattern that effectively restricts the pathogen to a subset of hosts and modulates genetic diversity as a function of host resistance [22]. For certain lineages, Frankia has been shown, to have coevolved with its host plants [23], dramatically altering its transcriptome upon symbiosis onset [24], and is thus expected to have underwent pressures at the level of gene composition. One way of monitoring evolutionary pressures on genomes is to follow rates of silent and non-silent mutations [25]. For pathogens, both diversifying (positive) selection [26,27] and purifying (negative) selection [28,29] have been reported. The situation in symbionts has not been extensively studied, except for few brief reports on Wolbachia or Rhizobium [30,31]. We undertook this investigation on genomes of three important but diverse genera of Actinobacteria to analyze the selection pressures working on them and have also looked into the evolutionary rate of secreted proteins to assess their biochemical adaptations to the environment. The genera include Frankia, a predominantly plant symbiont, Streptomyces, a group of soil-dwelling mostly free-living actinobacteria with a few pathogens, and Mycobacterium, which contains both free-living and pathogens.

Results and discussion
Background of the strains chosen for the analysis Candidatus Frankia datiscae (FD) is a non-isolated symbiont that forms effective nodules in Rosaceae, Coriariaceae, Datiscaceae [2]. Hundreds of attempts at isolation of the bacteria in pure culture have failed [32] and these strains are thus considered by many as obligate symbionts [33]. The extent of cospeciation is unknown because crossinoculation assays have yielded conflicting results. Frankia CcI3 (CcI) can be isolated and grown in defined media [34], however it belongs to a homogenous clade that in general is difficult to isolate and culture [2]. Historically, several attempts to isolate the Casuarina microsymbionts failed or else yielded atypical strains later found to belong to cluster 3 that could not fulfill Koch's postulates [35]. On the other hand, Frankia alni ACN14a (ACN) can be isolated in pure culture and can nodulate Alnus and Myricaceae [36]. It is abundant in soils devoid of host plants and will grow well in the rhizosphere of Betula, a close relative of Alnus [37]. The two Elaeagnus isolates, EAN1pec (EAN) and EuI1c (EuI) grow well and rapidly in pure culture, are abundant in soils without host plants, grow rapidly in pure culture and have the most extensive host range that includes Elaeagnaceae, Myricaceae, Casuarinaceae (Gymnostoma), Rhamnaceae as well as Datiscaceae and Coriariaceae where they are present as co-inoculants. The range of substrates on which these strains grow is more extensive than that of other groups [1]. Based on the above information and the genome size, we have divided Frankia strains into 3 major groups as Group A: Predominantly free-living Facultative symbiont; Group B: Partly free-living Facultative symbiont and Group C: Marginally free-living or obligate symbiont (Table 1).
Mycobacterium species include both pathogenic as well as non-pathogenic ones. Pathogenic species include M. leprae TN, M. tuberculosis CDC1551 and M. ulcerans Agy99. Non-pathogenic strains include M. smegmatis MC2 155 and M. vanbaaleni PYR-1, both of them fast growing Mycobacterium that exist as saprophytes in the environment (Table 2).
Streptomyces species considered in the analysis are S. coelicolor A3 (2), S. avermitilis MA-4680, S. griseus NBRC 13350, S. scabiei 87.22 and S. somaliensis DSM. The first three are soil-dwelling saprophytes which are grown in chemostat cultures for the industrial production of various secondary metabolites including a wide range of antibiotics. The last two are either pathogens of a plant (S. scabiei 87.22) or of animals (S. somaliensis DSM 40738) ( Table 3). In all the three Tables (1, 2 and 3), (√) denotes present; (≠) denotes absent and (−) denotes not known.

Core genome
For the five Frankia genomes examined, a core Frankia genome of 982 genes was identified. Since the Frankia EuI genome was devoid of any nif genes, the core genome did not include them. This Frankia strain will induce nodule formation on its host plant, Elaeagnus umbellata, but produces ineffective nodules that are unable to fix nitrogen [38]. Amongst the Mycobacterium genomes, 665 genes were identified as belonging to their core genomes. Since the Mycobacterium leprae genome is undergoing reductive evolution, its inclusion in the analysis may have resulted in a considerable decrease in the number of genes in the core genome for the Mycobacterium. The five genomes of Streptomyces contain 1304 genes in the core genome. Table 4 shows the Average Ka/Ks values of all of the gene orthologs belonging to the core genome.
The silent mutation rate (Ks) of all Frankia strains was found to range from 6.458 substitution/site between ACN and CcI to 39.412 between EuI and Ean, evocative of saturation. The non-silent rate (Ka) was much lower, ranging between 0.092 substitution/site between ACN and CcI to 0.205 or twice as much between FD and EuI. The Ka/Ks fluctuated in a narrow range of 0.029-0.047, a very low value indicative of a strongly purifying selection, lower than that seen in the pol gene of the bovine immunodeficiency virus [39]. This greater than 20-fold difference in mutation rates also illustrates why proteinbased phylogenies are better for reconstructing distant relationships than DNA-based ones.
The trends are also more or less similar in Mycobacterium. The silent mutation rate in Mycobacterium ranges from 2.155 between M. tuberculosis and M. leprae to 28.61 between M. tuberculosis and M. smegmatis. The non-silent rate ranges between 0.097 between M. tuberculosis and M. ulcerans and 0.196 between M. vanbaalenii and M. leprae. The silent mutation rate of Mycobacterium is thus in general much higher than that of Frankia while the non-silent rates are comparable between the two taxa. The Ka/Ks fluctuated in a range of 0.026 to 0.089, larger than in Frankia.
The silent mutation rate in Streptomyces ranges from 7.197 between S. scabiei and S. avermitilis to 28.933 between S. somaliensis and S. avermitilis. The nonsilent rate ranges from 0.098 between S. scabiei and S. avermitilis and 0.156 between S. somaliensis and S. coelicolor. The Ka/Ks fluctuated in a range of 0.035 to 0.057, smaller than in Mycobacterium and comparable to that in Frankia.
The core secretome (Additional file 1: Table S1) of Frankia is represented by 69-89 genes with the nitrogenfixing symbiotic strains having between 69 and 79 while the non-efficient cluster 4 EuI has 89 genes. The COG categories (besides the poorly defined "R" and "S") that were mostly represented in the Frankia core secretome were "M" (Cell wall/membrane/envelope biogenesis), E (Amino acid transport and metabolism), O (Posttranslational modification, protein turnover, chaperones) and U (Intracellular trafficking, secretion, and vesicular transport). The categories that varied the most between the symbiotic strains and the more saprophytic ones were M and V (Defense mechanisms). Mycobacterium had a smaller core secretome of 31-40 genes with the pathogenic M. leprae and M. tuberculosis having the smallest number of genes. The COG categories that were most abundant were M, E and C (Energy production and conversion). Streptomyces had the largest core secretome of the three genera with 72-89 genes. The COG categories that were most represented were M, E, P (Inorganic ion transport and metabolism) and T (Signal transduction mechanisms). A correspondence analysis shows those strains that interact closely with eukaryotic hosts have their secretome positioned close to one another (FD and MT) and away from the more saprophytic strains ( Figure 1). Curves joining the genomes as a whole to the secretomes were horizontal in the case of the FD and MT genomes while they were more vertical in the other cases.

Secretory proteins evolve faster than non-secretory proteins
The non-synonymous mutation rate (Ka) of secretory proteins was found to be higher than that of the non-secretory proteins except in one pair (Frankia CcI/EuI) where it was equal. The Ka/Ks ratio reflects the rate of adaptive evolution against the background rate. This parameter has been widely studied in the analysis of adaptive molecular evolution, and is regarded as a general method of measuring the rate of sequence evolution. To assess the intensity of mutational constraints, we have considered all of the genes belonging to the core genome for all studied strains of Frankia, Mycobacterium and Streptomyces. When these core genes of all Frankia, Mycobacterium and Streptomyces genomes were studied in all possible pairwise combinations separately for each genus for evolutionary rate analysis, we did find statistically significant differences in Ka/Ks ratios between the secretory and non-secretory protein genes (Mann-Whitney U test significance at P < 0.001 level) in Frankia ACN/CcI pair and Frankia CcI/FD. Complete list of Signal peptide bearing genes belonging to the core genome of Frankia along with their annotation is provided in Additional file 1: Table S1. For the other Frankia cases, the differences were not significant. A similar analysis of the Mycobacterium genomes showed significant differences with the M. tuberculosis/M. leprae and in M. tuberculosis/M. ulcerans pairings, while in Streptomyces genomes significant differences with S. coelicolor/S. scabies and S. avermitilis/ S. scabies pairings were found. Interestingly, all of the Frankia and Mycobacterium and some of the Streptomyces genomes, which showed significant evolutionary rate differences between secretory and non-secretory protein genes, were either pathogenic, marginally free-living facultative symbiont or at least partly free-living facultative (for grouping refer to Table 1, 2 and 3). This observation prompted us to study these genomes in greater details through    pairwise Ka/Ks ratio analysis of all the orthologous genes; both core and non-core (please refer to 'Secretory protein vs. non-secretory protein in Pairwise comparison' section). The normal distribution (Gaussian) curve of the Ka/Ks value for the Secretory protein genes is somewhat skewed (data not shown). The skew of the Ka/Ks in the case of secretory proteins may be associated with biochemical adaptations to the environment. There have been many instances where Ka/Ks values were found to be skewed. For instance, secreted proteins were found to be under low purifying selection in human-mouse sequence alignments [40]. On the other hand, essential genes of E. coli were shown to be under strong purifying selection [41] while on the contrary, in the case of plant R genes [42], CHIK envelope proteins [43] and Shigella effector gene [44], diversifying selection was shown. In some cases like flu virus HA protein, both purifying and diversifying selection occur at the same time in different sites [45].

Signal peptides evolve faster than mature regions
A secretory protein is functional only when it reaches the appropriate cellular compartment. The translocation of secretory proteins across the bacterial cytoplasmic membrane can be mediated by N-terminal signal peptides. After translocation across the membrane, signal peptides are normally cleaved from the preprotein by signal peptidases and it has even been suggested signal peptides may end up in the membrane there to play a role unrelated to that of the rest of the proteins [46]. Numerous analyses have indicated that there are considerable rate variations among genes and across different gene regions or subdomains [47]. This suggests that signal peptide (SP) parts might have rates of molecular evolution that are different from that of the mature peptide (MP) parts. In all possible Frankia pairs, significant differences were found in the degree of evolutionary change (i.e. Ka/Ks) between SP and MP (Mann-Whitney U test, P < 0.001) ( Table 5). However, the Frankia ACN/ EAN, and ACN/Eul pairings showed more prominent differences between signal peptide and mature parts. Similar trends were also observed among the Mycobacterium and Streptomyces genomes. In many cases, the Ka/Ks values of signal peptides were found to be 2-7 times higher than those of the mature proteins. Similar results for an increased rate of evolution of signal peptides were reported for yeast [48] and avian growth hormone genes [49]. Although there might be a     tendency, we failed to find a strong correlation between Ka/Ks values of the mature and signal peptides. For all of our datasets, the Ka/Ks value of the signal peptide was found to strongly co-vary with Ka/Ks value of the entire peptide. Thus, it seems that the rate of evolution of the entire peptide may be correlated with the rate of evolution of the signal peptide.

Distribution into COGs
In order to detect if the core genome and conserved secretome had similar contents, they were distributed into functional categories (COGs) and compared with the whole genome of two representative strains for the three genera Frankia, Streptomyces and Mycobacterium. It thus seems that the two intracellular bacteria (MT and FD) shared a similar distribution of their secretomes into COGs. The full genomes have similar tendencies in that the pairs of genomes belonging to the three genera were close to one another especially Streptomyces and Frankia and associated with categories I (lipid transport) in the case of Mycobacterium, with categories P (Inorganic ion transport and metabolism) and C (Energy production and conversion) in the case of Frankia and with categories T (signaling), K (transcription) and V (defense) in the case of Streptomyces. When core secretomes were considered, the three pairs were not maintained with the three strains (MT, FD and SS) comprising pathogens being close to one another while the saprophytes were more distant ( Figure 1). With regards to their secreted proteomes, Frankia FD and Mycobacterium MT were closer to one another than either was to Streptomyces (Figure 1).

Codon usage bias affecting the selection pressure
We examined whether evolutionary constraints on the genes are influenced by the codon usage bias. For Frankia ACN and Frankia FD (Figure 2), the evolutionary rate, particularly the Ka/Ks ratio, was negatively correlated to CAI values for all of the genes belonging to the core including the secretory protein genes (Pearson correlation coefficient, R = −0.017 for core genes and R = −0.16 for secretory protein genes). Similar trends were also found with the Mycobacterium strains. One explanation for this negative correlation is that codon usage bias correlated positively with the intensity of purifying selection [50]. Therefore, genes with a stronger codon usage bias (i.e. with high CAI value) will undergo higher negative selection pressure and thus, the evolutionary rate will be slower at non-synonymous or synonymous sites. On the other hand, Table 5 The rate of synonymous (Ks) and non-synonymous (Ka) nucleotide substitution for secretory (signal peptide and mature peptide) and non-secretory proteomes (Continued) Frankia strains CcI, EAN, and Eul, and Streptomyces, showed a negative correlation between Ka/Ks ratio and CAI values for the core genes as a whole, while the secretory proteins exhibited a reverse trend (i.e. the Ka/Ks ratio was positively correlated to CAI, with R values ranging from 0.188 to 0.262). This kind of unusual relationship between evolutionary rate and CAI value in signalpeptide-bearing genes was reported earlier for Streptomyces [48]. They have proposed that intensity of purifying selection was significantly relaxed in such genes.

Secretory protein vs. non-secretory protein in pairwise comparison
Various combinations of Frankia, Mycobacterium and Streptomyces genes were used for pairwise calculation of Ka/Ks. For this analysis, we have first screened out the orthologous gene pairs between genome pairs and then calculated the Ka/Ks value for all orthologous gene pairs. From these data, the secretory protein genes were identified as those predicted to have a signal peptide in both members of the orthologous pair. Their Ka/Ks values were compared  Table S2. In Table 6, a matrix format is provided with each cell representing the difference between average Ka/Ks value of secretory protein genes and non-secretory protein genes. Generally among the marginally free-living facultative symbiont (Group C strains) Frankia strains (i.e. Frankia CcI and FD), the difference in evolutionary rates of secretory proteins and non-secretory proteins was quite robust. The Mann-Whitney U-test showed that the difference was highly significant (P These above results in total indicate an overall trend that the evolutionary constraints on secretory proteins as a whole in marginally free-living facultative symbiont or pathogenic strains were significantly increased compared to those occurring in saprophytic or free-living organisms. A possible explanation for this trend is that high Ka/Ks ratios of secretory proteins in pathogens and symbionts may reflect adaptive evolution of their sequences.

Conclusions
A definite trend emerged from our analysis of the evolutionary rates and patterns for various gene types among five Frankia, five Mycobacterium and five Streptomyces genomes. Secretory protein genes for obligate symbionts, marginally free-living facultative symbionts or pathogenic organisms, evolved significantly faster than non-secretory protein genes, whereas genomes of saprophytes or predominantly free-living facultative symbionts did exhibit significant changes in rate. This difference may be a telling genomic signature of loss of autonomy. Although robust purifying selection was encountered in most of the analyses, the secretory protein genes were found to be under stronger evolutionary selection pressure than non-secretary protein genes in symbiotic and pathogenic strains. This difference could be an adaptive strategy for them to interact better with their hosts. Further, within the secretory protein genes, the evolution rate (Ka/Ks) of signal peptide, on average, was 2-7 times higher than that of mature proteins. This result suggests that signal peptides might be under relaxed purifying selection. Codon usage analysis of actinobacterial strains under host selection pressure (such as symbiotic Frankia, ACN, FD and the pathogenic Mycobacterium) suggests that codon usage bias had a negative impact on the selective pressure exerted on the secretory protein genes. These organisms remain in continuous cross-talk with their host particularly through the signal peptides. It thus appears symbiotic and pathogenic bacteria try to remain in a discrete expression mode to avoid elicitation of host defense responses, while concurrently accumulating evolutionary neutral synonymous substitutions.
The expected arrival of a large number of genomes, in particular in genus Frankia and relatives, may yield more closely related genomes on which to calculate a larger number of conserved genes than is possible in strains with different host infectivity spectra that have diverged for several millions of years with a reduced core genome. This should help identify proteins and domains subject to strong evolutionary constraints, in particular in lineages where little or no isolates are available among which those determinants involved in symbiotic interactions.

Selection of genomes used in this study
The nucleotide sequences along with their deduced amino acid sequences for all the protein coding gene sequences of five Frankia strains namely ACN14a (NC_008278), CcI3 (NC_007777), EAN1pec(NC_009921), EuI1c(NC_014666) and symbiont of Datisca glomerata (CP002801) and hereafter will be referred to as ACN, CcI, EAN, EuI and FD respectively along with five Streptomyces strains : S. coelicolor A3 (2)

Identification of orthologous genes
Orthologous genes were identified based on the Reciprocal Best Hits (RBH) approach on amino acid sequences for all the protein coding gene sequences with an E-value threshold of 1e -10 ; an identity ≥ 50% over at least 50% of the alignable region. This approach and parameters had been used previously for screening orthologs in Streptomyces [51].

Identification of secretory protein genes
Secretory protein genes belonging to the core genome were identified using the SignalP 3.0 [52] and TMHMM 2.0 [53] software. Only those genes predicted as secretory proteins by both artificial neural networks and hidden Markov models were selected. Sequences predicted to contain a signal peptide by SignalP were analyzed with TMHMM 2.0 to determine the number of transmembrane (TM) domains. Those having 0-2 transmembrane domains were further considered as done by Mastronunzio et al. [12]. Individual examination of selected genes was made to ensure only genes with viable peptide leader were selected. For the comparison of evolutionary rates of the mature part and the signal peptide part, a dataset of orthologs which signal peptide cleavage site have been detected in both entities was compiled. Mature peptides (complete sequence minus signal peptide) were analyzed by editing out the predicted signal peptide from the alignment file using a Perl script developed by us.

Evolutionary rate analysis
Orthologous gene alignments were utilized for evolutionary rate analyses. The number of nonsynonymous or synonymous substitutions per site (Ka or Ks, respectively) and their ratio (Ka/Ks) was estimated with Codeml in the PAML software program [54]. A bioperl script was used with the PAML program to estimate the pairwise Ka and Ks values. The script first translated cDNAs into proteins and aligned the protein sequences. The protein alignments were projected back into cDNA coordinates and used by the PAML module to calculate the Ka/Ks ratio using the maximum likelihood method. To study the evolutionary rate of the signal peptide part and the mature part of a protein, the Ka/Ks value of each component of the protein was determined separately.

Codon bias analysis
Codon adaptation index (CAI) is a measure of directional synonymous codon usage bias [55]. The index uses a reference set of highly expressed genes from a species to assess the relative usage of each codon, and the score of each gene is calculated from the frequency of use of all codons in that gene. The index assesses the extent to which selection has been effective in molding the pattern of codon usage. The CAI value for each gene belonging to core genome was calculated with the help of CAI Calculator 2 (http:// userpages.umbc.edu/~wug1/codon/cai/cais.php) [56].