Buffering by gene duplicates: an analysis of molecular correlates and evolutionary conservation

Background One mechanism to account for robustness against gene knockouts or knockdowns is through buffering by gene duplicates, but the extent and general correlates of this process in organisms is still a matter of debate. To reveal general trends of this process, we provide a comprehensive comparison of gene essentiality, duplication and buffering by duplicates across seven bacteria (Mycoplasma genitalium, Bacillus subtilis, Helicobacter pylori, Haemophilus influenzae, Mycobacterium tuberculosis, Pseudomonas aeruginosa, Escherichia coli), and four eukaryotes (Saccharomyces cerevisiae (yeast), Caenorhabditis elegans (worm), Drosophila melanogaster (fly), Mus musculus (mouse)). Results In nine of the eleven organisms, duplicates significantly increase chances of survival upon gene deletion (P-value ≤ 0.05), but only by up to 13%. Given that duplicates make up to 80% of eukaryotic genomes, the small contribution is surprising and points to dominant roles of other buffering processes, such as alternative metabolic pathways. The buffering capacity of duplicates appears to be independent of the degree of gene essentiality and tends to be higher for genes with high expression levels. For example, buffering capacity increases to 23% amongst highly expressed genes in E. coli. Sequence similarity and the number of duplicates per gene are weak predictors of the duplicate's buffering capacity. In a case study we show that buffering gene duplicates in yeast and worm are somewhat more similar in their functions than non-buffering duplicates and have increased transcriptional and translational activity. Conclusion In sum, the extent of gene essentiality and buffering by duplicates is not conserved across organisms and does not correlate with the organisms' apparent complexity. This heterogeneity goes beyond what would be expected from differences in experimental approaches alone. Buffering by duplicates contributes to robustness in several organisms, but to a small extent – and the relatively large amount of buffering by duplicates observed in yeast and worm may be largely specific to these organisms. Thus, the only common factor of buffering by duplicates between different organisms may be the by-product of duplicate retention due to demands of high dosage.


Estimation of gene duplicates (paralogs)
To validate our prediction of paralogs (gene duplicates), we examined gene family size distributions of the resulting groups of gene duplicates, and we tested several alternative approaches and compared our results to previous studies by Tong et al. [1,2] and to the gene families obtained using a single E-value cutoff (10 -10 ).

Figure S1. Gene family size distribution for all genomes in our analysis
Gene families are defined by BLAST E-value<10 -10 (3x10 -10 for worm, fly, 1x10 -9 for Mycoplasma). (A) absolute numbers; (B) fraction of genome. The distributions are as expected: small gene families are more frequent than large families. The fraction of gene families (compared to singletons) increases in eukaryotes compared to prokaryotes and in multi-cellular compared to uni-cellular organisms. There is no obvious enrichment of two-gene families (D=1) in yeast. These results confirm the validity of our method of paralog estimation.

Paralog estimation based on E-value cutoffs
First, we tested different E-value thresholds for homology estimation in yeast. For all three different Evalues (10 -10 , 10 -20 , 10 -30 ), we observe a significant enrichment in homologous genes amongst double knockout mutants with an SSL phenotype. When applying the same E-value cutoff used by Tong et al. [1,2] of 10 -8 , we obtained ~2% of homologous genes. This result is identical with what Tong et al. report, confirming our estimate. In the paper, we use a threshold of 10 -10 , as this proved to be the best compromise between conservative homolog estimation (low false-positive rate) but sufficiently many gene families for statistically meaningful analysis.
In addition to use of an E-value threshold, we tested several other constraints on paralog estimation: a) bidirectionality of the hit; b) length of the match region between two alignments; and c) single-linkage clustering. The results of these tests were examined manually with respect to the gene family size distribution, and sequence alignments of selected gene families. a) We tested whether gene family sizes would change when requiring both genes (query gene and its match) to have their match's E-values below the threshold, instead of just requiring the query gene to have an E-value<threshold to its match. This method represents a stricter paralog definition, decreasing gene family sizes and increasing the number singletons. As this additional requirement did not introduce any obvious advantage we decided against its use. b) Sequences can match across their whole length or along only part of their length. This behavior is expressed by the 'alignment match length' which denotes the fraction of the shorter sequence that is aligned to the other sequence. We tested match length requirements of 0.6, 0.7 and 0.8, but did not observe significant changes to the gene family size distributions. The E-value itself is partially a function of match length, thus a relatively stringent cutoff (such as 10 -10 ) indirectly requires substantial match length. For our analysis of buffering capacity of homologous genes we did not require the genes to match over their entire length, as even parts of the gene (protein) could buffer for the function of the other. c) Gene-families can be reconstructed by grouping genes with common paralogs into one family. As our method includes local matches (matches along only part of the sequence, see b), a single-linkage clustering algorithm bears the danger of combining genes, via common homologs, that have no sequence similarity at all and are not paralogs. In addition, for our analysis we were interested in the number of paralogs per gene (effective gene family size D) rather than the actual gene family sizes, thus gene family clustering again did not seem a feasible step to do. In sum, all three variations did not visibly improve our paralog estimation, and in the spirit of parsimony we decided for simple application of one E-value cutoff. Note that, as described in the paper, this E-value cutoff has been adjusted in genomes much smaller or larger than yeast.

Paralog estimation based methods other than E-value cutoffs
We tested additional methods of paralog estimation independent of the absolute E-value between two sequences. a) In the 'drop' method, we examined for each gene the difference in -log(E-value) to its rank-ordered hits, with the minimal -log[E-value]>3. For each gene, we counted a hit as homolog if the difference between its -log[E-value] and the -log[E-value] of the next better hit was smaller than 2, 3 or 5. In other words, we counted all hits as homologs of a particular gene if their -log[E-value]s were similar to each other and significantly better than the -log[E-value]s of all other hits. We produced gene families from these groups of homologs using single linkage clustering. The 'drop' method resulted in a similar gene family size distribution as the E-value 'cutoff' method. b) In the 'bidirectional best hit' method, gene pairs qualified as paralogs if they had each other as their best hit, independent of the actual E-value, but requiring a minimum -log[E-value]>5. The disadvantage of this method is that it only produces pairs of paralogs, but not larger gene families. c) For yeast we also estimated paralogs using the Inparanoid database [3]. In this approach, each yeast gene was defined as a paralog of a query gene if its BLAST score is better than the BLAST score to an ortholog of the query gene. Estimating yeast paralogs with Arabidopsis, C.elegans, and human as reference genome did not result in sufficiently high numbers of paralogs for meaningful statistical analysis. A simple E-value cutoff proved to be as good as or better than the methods discussed above, hence we decided to use it as our primary method of paralog estimation. Figure S2. Survival rates of single-gene knockouts as function of effective gene family size and sequence distance (E-value) to the closest paralog.

The number of duplicates per gene and the distance to the nearest duplicate
A. The effective gene family size is defined by BLAST E-value cutoff 10 -10 as described in the main text (Methods). The effective gene family size D denotes the number of paralogs available for a given gene. Except for E.coli and worm, there are no or only weak correlations between D and the probability of survival P(S) of single-gene KOs.
B. The sequence distance is estimated as the E-value between the bait gene (targeted for single-KO) and its closest paralog. The histogram shows -log 10 (E-value) bins, with bin "0-5" denoting the least closely, bin "30+" denoting the most closely related genes. For some organisms, i.e. P. aeruginosis, E.coli, yeast, and worm, there is moderate to good correlation between -log 10 (E-value) and the probability of survival P(S) of single-gene KOs.

Figures S3, S4, S5. The influence of ribosomal and WGD genes on single-KO survival rates in yeast.
As ribosomal genes and genes originating from the whole-genome duplication (WGD) have characteristics different to genes of other duplications, we compared survival rates of single-and doublegene knockouts (SKO, DKO) in the whole yeast genome ( Figure S3) with those in the yeast genome without the WGD genes [4] (Figure S4) and those in the yeast genome without predicted ribosomal genes [5] (Figure S5). In each figure, the top (A) and middle panel (B) show the relationship between SKO survival rates and the E-value and effective gene family size, respectively. The bottom panel (C) shows the DKO survival rates in relationship to the effective gene family size. Note that an effective gene family size of D=1 in SKO represents a two-gene family, for example, but in DKO it represents a three-gene family.
The enrichment of two-gene families in yeast in which both genes buffer for each other can in part be explained by the two-gene families originating from the WGD. However, when removing all WGD duplicates, trends are similar to those amongst all genes. Removal of ribosomal genes does not change any of the trends.

Expression analysis
To test for the influence of expression level on buffering by gene duplicates, we tested three different measures: i) experimental expression data; ii) Codon Bias Index (see main text); and iii) Codon Adaptation Index. Figure S6. Chances of survival, fraction of genes with duplicates, and contribution of duplicates to survival in genes of different expression levels Information on gene expression was collected for each organism from different sources (Table S1), filtering for experiments which used conditions (strain, medium) similar to those of the KO/KD screens. The figures shows survival P(S), the fraction of genes with duplicates P(D≥1) and the contribution of duplicates to survival C for the different subsets. * P-value < 0.01; ** P-value < 0.001 Figure S7. Survival P(S) and the effective gene family size D or the E-value between the target gene and its nearest duplicate at different expression levels (yeast)

Experimental expression data
At the example of yeast, we illustrate that within subsets of (highly) expressed genes the correlation between survival P(S) and the effective gene family size D (A) or the E-value between the target gene and its nearest duplicate (B) is similar to that of all genes. This trend is the same for E.coli and worm (not shown). Figure S8. Chances of survival, fraction of genes with duplicates, and contribution of duplicates to survival in genes of high or low CAI

Codon Adaptation Index
To validate the findings from the experimental expression data, we used sequence-based approximations of expression levels, Codon Bias Index (CBI, main text) and Codon Adaptation Index (CAI). Both CBI and CAI were obtained from the CodonW server [6], using standard settings, but adjusting parameters for the respective organism. Calculation of CAI requires a reference data set of expressed genes for calculation of the index, whereas calculation of CBI is purely based on nucleotide sequence. Both measures are expected to work less well in multi-cellular than in single-cellular organisms due to tissue-specific expression levels which cannot be captured by a single sequence feature.
We rank-ordered the values and selected subsets of genes with the highest or lowest CAI, respectively. The sizes of the subsets varied according to the organism's genome size. The figure shows survival P(S), the fraction of genes with duplicates (D≥1), and the contribution of duplicates to buffering C in the top, middle and bottom panel, respectively. Both CBI and CAI show similar trends with respect to survival P(S), the fraction of genes with duplicates (D≥1) and the buffering capacity C as the experimental expression data (main text and Figure S7).

Figure S9. Gene ontology annotation
Function annotation using the Gene Ontology vocabulary was downloaded from the GO website [7] and mapped to a generic GO Slim vocabulary from the same website. GO annotation was only available for yeast, worm, fly and mouse. The figures include functional categories with very few members (<30 genes) which have little statistical power.
A. The probability of survival P(S) for genes of different functions. B. The contribution of duplicates to buffering C for genes of different functions. Survival rates and buffering capacity vary widely across the different functional categories in the four organisms -function is not a correlate of buffering by duplicates. Similar to highly expressed genes, function categories of low P(S) tend to have higher contributions of duplicates to survival and vice versa. No bias for certain functions towards high or low buffering capacity occurs consistently across all organisms.
In yeast, duplicates do not buffer genes of growth and reproduction, but duplicates buffer kinases. In contrast, kinases and other cell cycle proteins in worm are hardly buffered by duplicates, while ribosomal proteins have a contribution of duplicates much higher than average. Due to multiple hypothesis testing, only P-values<0. 01 should be considered significant; none of the functional categories meet this threshold (Z-score across distribution of functional categories).
S -survival upon gene deletion; aa and derivative metab. -amino acid and derivatives metabolism; cellular component org. /biogen. -cellular component organization/biogenesis

Figure S10. Function annotation based on protein domains
Function annotations based on protein domains were obtained from the SUPERFAMILY database [5] using its domain predictions and the domain function annotation. The function annotation scheme consists of seven major categories which map to 50 smaller categories. The annotation was available for all organisms of our analysis. Probability of survival P(S) (top) and contribution of duplicates to buffering C (bottom) shown separately for seven organisms, shown for the major functional categories. Function categories of low P(S) tend to have higher contributions of duplicates to survival and vice versa.

Two-gene families as a model for buffering by duplicates
Previous studies reported that the contribution of duplicates to buffering when examined in double gene-KOs is very small (~2%) [1,2]; however, this contribution is an underestimate for two reasons. First, those studies did not account for different gene family sizes. For example, if both genes of a two-gene family are knocked-out, the phenotype is more likely to be lethal than if two genes of a larger family are knocked-out and additional duplicates are available to buffer for loss of function. Second, two genes, which are unrelated by sequence similarity, can produce a viable double-KO phenotype i) because one or both genes are members of separate gene families in which the duplicates buffer for the KO; ii) because one or both genes are of functions that are not essential under tested conditions. This ambiguity with double-KOs of unrelated genes inflates their chances of survival and makes them less valuable for analysis of buffering by duplicates.
Thus, in our analysis we explicitly distinguish between double-KOs of related and unrelated genes ( Figure S11, middle/right). We also distinguish between different effective family sizes. We find that overall chances of survival of double-KOs are similar to that of single-KOs (P(S)=0. 93 vs. P(S)=0. 82) when focusing on double-KO targets which are sequence-related (Figure S11, middle) and excluding double-KOs of unrelated genes (Figure S11, right). In double-KOs of sequence-related genes ( Figure  S11, middle), overall survival rates are slightly elevated compared to those of single-KOs. This is most likely due to the a priori bias of double-KO experiments towards non-essentials genes -all genes tested by double-K0 screens where non-essential in single-KO screens.
We observe that the yeast genome is enriched for two-gene families whose members are likely to buffer for each other (Figures S11B). In single-KOs (Figure S11B, left), chances of survival are higher amongst two-gene families (D=1) than amongst singletons (D=0) and larger gene families (D≥2). In double-KOs of sequence-related genes ( Figure S11B, middle), yeast two-gene families (D=0) have drastically reduced chances of survival. Two-gene families in yeast are enriched for duplicates originating from the whole-genome duplication (WGD) [4,8], however, the observation from Figure S11B holds true even if members of the WGD are removed (Figure S12). Survival of all 609 two-gene families in our set (P(S|D=1)=0. 51; Figure S11B middle panel) is even lower than survival of the all WGD gene pairs (P(S)=0. 86, from ref. [8]). The enrichment of buffering two-gene families in yeast is also not due to preferential duplication of ribosomal genes (Figure S12), nor do we observe it in worm ( Figure S11C).

Figure S11. Yeast two-gene families are enriched for buffering duplicates
Two-gene families play a special role in yeast. Survival upon single gene-KO is higher in two-gene than in larger families (left), i. e. the two-gene families are enriched for families in which both genes likely buffer for each other. If both genes of a two-gene family are knocked out, chances for survival are low (middle). The trends hold true even when accounting for whole-genome duplicates [8] or ribosomal genes (see Figure S12).
A. When examining survival upon single and double gene-KO in yeast, we distinguish between different buffering scenarios. The cartoons depict genes (circles) and their homologous relationships (lines) as predicted by sequence similarity. Filled circles (black) denote knocked-out genes. Left: single-KO of a singleton, a gene in a two-or three-gene family leaves zero, one or two additional duplicates, respectively, which can buffer for the KO. In the case of double gene knockouts, the two genes can either be sequence-related (i. e. homologous; middle) or unrelated (right). Middle: depending on the family size (two, three, four), after double-KO zero, one or more duplicates remain. Right: if the two double-KO genes are unrelated, zero (one, two) additional paralogs can be achieved if the genes are singletons (members of two-gene or three-gene families). In each group of buffering/KO scenarios (left, middle, right), the number of additional duplicates D is zero, one or more, while the actual family size varies.
B. Chances of survival upon gene-KO in yeast are comparable for single gene-KOs (left) and double gene-KO of sequence-related genes (middle). Survival is generally much higher in double gene-KOs of unrelated genes (right) since those two genes are unlikely to buffer for each other. Two-gene families in yeast are enriched for genes that buffer for each other: chances of survival are higher in single gene-KOs of two-gene families than of larger families. When both genes of a two-gene family are knocked-out, survival chances are low (middle). The red arrows point to the unusual behavior of yeast two-gene families. Numbers printed in the columns report the total number of tested genes (single gene-KO) or tested pairs (double gene-KO).
C. In worm, we observe trends similar to those in yeast (B). Duplicate genes increase chances of survival in single gene-KDs. For double gene-KDs, the situation is less clear, partly due to lower numbers of genes with KD information. In worm, there is no enrichment in buffering two-gene families as observed for yeast. Numbers printed in the columns report the total number of tested genes (single gene-KO) or tested pairs (double gene-KO). Two-gene families play a special role in yeast, as explained in the main text. The families are enriched for duplicates which buffer for each other's loss of function, as can be seen in the comparatively high survival rates of two-gene families (D=1) in single-gene KOs and the low survival rates in of two-gene families (D=0, related genes) in double-KOs (both marked by arrows)(A). The trend holds true even if removing ribosomal genes (B) or genes of the whole-genome duplication (WGD) (C) [4]. Genes of the WGD are enriched for buffering two-gene families [8], but possibly not all WGD pairs have yet been identified. We also observe enrichment in buffering two-gene families in yeast in a set of predicted genetic interactions [2] (not shown).

Yeast
Two-gene families, when targeted for single-and double-gene KO, are a good set for studying characteristics of buffering by gene duplicates (see main text). Large-and small-scale double gene-KO tests identified 50 two-gene families with an SSL phenotype (buffering genes). The characteristics of these families are compared to characteristics of the 559 remaining two-gene families in yeast, and to the characteristics of nine two-gene families with viable phenotypes in double-KOs (Table S3). Table S3B, D also describe the comparison of buffering and non-buffering two-gene families in terms of similarity of vectors describing their interactions. The vectors describe single gene-KO phenotypes [9], function [10], genetic interactions (from the large scale screens described in the main text as well as single SSL interactions listed in BioGRID [11]) and physical interactions (as listed in BioGRID [11]). The table lists several measures of vector similarity, of which the Jaccard index is used in the main text.
Duplicates from the WGD are known to have properties different to those of other duplicates [8,12]. We tested some of the properties listed in Table S3 for all 108 WGD two-gene families in comparison to the 501 two-gene families not identified to originate from the WGD [4] (Table S4). All tested properties are consistent with the findings on buffering in comparison to non-buffering genes. While there is a link between buffering capacity and origin of duplicates in the WGD (reflected in distinct protein characteristics), we cannot resolve causality. We hypothesize, however, that WGD gene pairs are strongly enriched in duplicates that buffer for each other's loss of function in single-gene deletions.

Worm
We extracted the 143 worm two-gene families tested in double-RNAi knockdowns by Tischler et al. [13] which resulted in 16 pairs of synthetic sick or lethal (SSL) phenotypes. We calculated the Codon Adaptation Index for the worm genes using a webserver, http://www.evolvingcode.net/codon/cai/cais.php [14], and found a significant bias similar to that in yeast (see main text), suggesting that buffering genes are more efficiently translated than non-buffering genes. Tables   Table S1. Number of genes in subsets of expressed genes.
The table lists the number of genes in each subset as well as the number of essential genes in the subsets. Information on gene expression was collected for each organism from different sources (Table  S1), filtering for experiments which used conditions (strain, medium) similar to those of the KO/KD screens.
For dual channel microarray experiments, we estimated expression levels based on the spot intensities in the microarrays. In all organisms (except for M. genitalium and mouse), we rank-ordered the quantitative expression levels and selected a subset of genes of the highest and lowest expression levels. Subsets were chosen proportional to dataset size. We tested different cutoffs for subset selection, all with the same qualitative results (not shown). In M. genitalium and mouse only protein identifications but no quantitative data was available, thus we divided the data into sets of 'expressed' (observed) and 'not expressed' (not observed) genes. The set of 'NOT Expressed Genes' is the set of all genes without the expressed genes. 'NOT Expressed Genes' can also be expressed, although at lower levels.
* -expression estimated from spot intensity; GEO -reference [15]; MGD -reference [16]; SMDreference [17] [1], several other large-scale studies have been conducted to-date. We compiled a list of 14 studies marked as 'systematic deletion screen' in GRID [11], which in total describe double-gene KOs of 204 baits against all non-essential yeast genes, resulting in 12,267 SSL interactions.  Hence we also conducted all tests with an extended dataset of all two-gene families in yeast minus the 50 SSL two-gene families, resulting in 559 pairs.
The table lists all properties that we have tested for these sets of two-gene families. All properties were examined either across all genes in the respective set (A, C), or between the genes (B, D). Features from the calculations across genes are calculated between genes as |feature(gene1)-feature(gene2)|. Other features, e.g. sequence similarity, only exist between genes. Due to multiple hypothesis testing (~50 tests), a t-score>3.26 should be considered significant at an adjusted P-value of 0.05. Table E lists the numbers of orthologs and their essentiality (if known) for buffering and non-buffering gene pairs. Orthology is determined by InParanoid [3].
avg. -average  Table S4. Properties of yeast WGD two-gene families and non-WGD two-gene families Duplicates arising from the whole genome duplication (WGD) are different to other duplicates [8]. WGD genes are also enriched in pairs of SSL interaction [8], thus likely to buffer for each other. Conversely, some of the properties of the 'buffering' genes discussed in our paper (see main text and Table S2) may be accounted to the enrichment of WGD genes amongst buffering genes, although the enrichment of WGD in buffering genes is not significant (Table S2). We tested some of the properties listed in Table S2 for all WGD two-gene families in comparison to two-gene families not identified to originate from the WGD. All properties are consistent with the findings on buffering in comparison to non-buffering gene.

Complete yeast genome
Without WGD genes

B.
Without ribosomal genes