To what extent gene connectivity within co-expression network matters for phenotype prediction?

Recent literature on the differential role of genes within networks distinguishes core from peripheral genes. If previous works have shown contrasting features between them, whether such categorization matters for phenotype prediction remains to be studied. We sequenced RNA in a Populus nigra collection and built co-expression networks to define core and peripheral genes. We found that cores were more differentiated between populations than peripherals while being less variable, suggesting that they have been constrained through potentially divergent selection. We also showed that while cores were overrepresented in a subset of genes deemed important for trait prediction, they did not systematically predict better than peripherals or even random genes. Our work is the first attempt to assess the importance of co-expression network connectivity in phenotype prediction. While highly connected core genes appear to be important, they do not bear enough information to systematically predict better quantitative traits than other gene sets. ∗Equal contribution †Corresponding author: vincent.segura@inra.fr 1

than core genes and consequently, they harbor larger 48 amounts of variation at population levels. 49 Furthermore, classic studies of molecular evolu-   From these populations, genotypes were collected and planted in 2 locations (Orléans, in central France, and Savigliano, in northern Italy). At each site, we planted 6 clones of each genotype, 1 in each of the 6 blocks, and their position in each block was randomized. For all the blocks, we collected phenotypes: 10 in Orléans (circumference, S/G, glucose, C5/C6, extractives, lignin, H/G, diameter, infradensity and date of bud flush) and 7 in Savigliano (circumference, S/G, glucose, C5/C6, extractives, lignin, H/G). Only on the clones of 2 blocks in Orléans, we performed the RNA sequencing and treatment of data. The treated RNAseq data were used with different algorithms and in different sets to predict the phenotypes measured on the same trees (in Orléans) or on the same genotype but on different trees (in Savigliano).
were presumably involved in the experience, to look gene expression network is to use the weighted underlines the usefulness of kME as a centrality score 238 to further characterize the genes within each module. 239 We thus used this centrality score to define further 240 the topological position of our gene expressions in the 241 network and to serve as a basis for role comparisons 242 between genes. For each gene, we used its highest 243 absolute score, which corresponds to its score within 244 the module to which it was assigned. We selected the 245 10% of genes with the highest global absolute scores 246 to define the core genes group, and 10% with the   (Table S4). While it is typically discarded in classic cluster-259 ing studies, we chose to maintain it and rather un-260 derstand its composition and role, by adding to the 261 comparative study two peripheral sets, one with and 262 one without grey module genes (subsequently called 263 "peripheral NG", NG for "no grey").

264
To assess the robustness of WGCNA analysis re-265 sults, we compared it to a k-means clustering (R      The key difference, however, is that cores were not 570 the only contributors to the Boruta sets. It seems 571 that cores are able to summarize key information for 572 quality predictions but require a complementary con-573 tribution from other interacting genes to round up 574 the optimal set. This is better reflected by the perfor-575 mance of the Boruta set, which obtained the best per-576 formance predicting traits under the NN algorithm.

577
To some extent, the NN algorithm exploits the inter-  The information that they contain has to be com-694 pleted by other genes. The mean connectivity score 695 (kME) of the Boruta sets is around 0.7. However, as 696 genes seem to be very interactive, predicting a phe-  (Table S3).
Finally, we computed for each gene expression 902 the coefficient of genetic variation (CV g ) by dividing 903 its total genetic variance (σ 2 b + σ 2 w ) by its expression 904 mean.

905
Other population statistics 906 We further used a previously developed bioinformat-      3  20  darkgrey  37  1  21  saddlebrown  58  0  1  violet  53  0  0  white  27  0  23  darkmagenta  39  3  7  lightyellow  33  4  10  orange  45  0  0  darkorange  43  0  1  darkred  43  0  0  royalblue  37  0  5  green  25  0  14  lightgreen  37  0  0  paleturquoise  24  1  12  skyblue  32  1  4  tan  27  1  9  darkgreen  22  1  10  darkolivegreen  25  1  3  midnightblue  18  1  10  steelblue  25  1  3  yellowgreen  22  1  2  sienna3  19  2  2  skyblue3 19 0 0 Figure S1: PCA of the different cofactors (Xylem and cambium scraper, extractor and extraction method, population, sequencing column, line and plate, the growth rate at harvest, sampling date, time, temperature, solar radiation, humidity and wind speed). Each of these represents the distribution of the individuals on the 2 first axes of the PCA (representing 17,7% of the variation), colored by class. Cofactors related to weather are presented in the 6 lower plots.   Figure S3: Heatmap of module-trait Spearman's correlations, on a dark blue (high negative correlation) to light yellow (high positive correlation) scale. We removed correlations with a p-value lower than 5% after Bonferroni correction. From the total of 425 correlations, 72 remained. Figure S4: Relationship between Spearman's correlations between module-trait (y-axis) and gene significance-kME (x-axis).  between LM and NN prediction scores for the core (in blue), random (in grey), peripheral (in brown), peripheral (in orange) and Boruta gene sets (in green).(B) the LM differences are in red and the NN differences in turquoise and the color filling the bar represents the difference between core and peripheral genes in brown, core and peripheral NG in orange and between the random sets in grey. For the random pairs, error bars represent the first and third quartiles of the differences between pairs of randomized sets and the bar corresponds to the median.

Supplemental figures
A B