Comprehensive genome-wide identification of angiosperm upstream ORFs with peptide sequences conserved in various taxonomic ranges using a novel pipeline, ESUCA

Background Upstream open reading frames (uORFs) in the 5′-untranslated regions (5′-UTRs) of certain eukaryotic mRNAs encode evolutionarily conserved functional peptides, such as cis-acting regulatory peptides that control translation of downstream main ORFs (mORFs). For genome-wide searches for uORFs with conserved peptide sequences (CPuORFs), comparative genomic studies have been conducted, in which uORF sequences were compared between selected species. To increase chances of identifying CPuORFs, we previously developed an approach in which uORF sequences were compared using BLAST between Arabidopsis and any other plant species with available transcript sequence databases. If this approach is applied to multiple plant species belonging to phylogenetically distant clades, it is expected to further comprehensively identify CPuORFs conserved in various plant lineages, including those conserved among relatively small taxonomic groups. Results To efficiently compare uORF sequences among many species and efficiently identify CPuORFs conserved in various taxonomic lineages, we developed a novel pipeline, ESUCA. We applied ESUCA to the genomes of five angiosperm species, which belong to phylogenetically distant clades, and selected CPuORFs conserved among at least three different orders. Through these analyses, we identified 88 novel CPuORF families. As expected, ESUCA analysis of each of the five angiosperm genomes identified many CPuORFs that were not identified from ESUCA analyses of the other four species. However, unexpectedly, these CPuORFs include those conserved in wide taxonomic ranges, indicating that the approach used here is useful not only for comprehensive identification of narrowly conserved CPuORFs but also for that of widely conserved CPuORFs. Examination of the effects of 11 selected CPuORFs on mORF translation revealed that CPuORFs conserved only in relatively narrow taxonomic ranges can have sequence-dependent regulatory effects, suggesting that most of the identified CPuORFs are conserved because of functional constraints of their encoded peptides. Conclusions This study demonstrates that ESUCA is capable of efficiently identifying CPuORFs likely to be conserved because of the functional importance of their encoded peptides. Furthermore, our data show that the approach in which uORF sequences from multiple species are compared with those of many other species, using ESUCA, is highly effective in comprehensively identifying CPuORFs conserved in various taxonomic ranges.

acid sequences. For each candidate CPuORF, a representative uORF-tBLASTn and mORF-tBLASTn hit is 160 selected from each order, and the putative uORF sequences in the representative uORF-tBLASTn and 161 mORF-tBLASTn hits are used for the calculation of the K a /K s ratio. If the K a /K s ratio of a candidate CPuORF is 162 less than 0.5 and significantly different from that of the negative control with q less than 0.05, then the candidate 163 CPuORF is selected for further analysis. The final step is to determine the taxonomic range of uORF sequence 164 conservation. In this step, the representative uORF-tBLASTn and mORF-tBLASTn hits selected in the fifth step 165 are classified into taxonomic categories (Fig. 4). On the basis of the presence of the uORF-tBLASTn and 166 mORF-tBLASTn hits in each taxonomic category, the taxonomic range of sequence conservation is determined 167 for each CPuORF. 168 169 Identification of angiosperm CPuORFs using ESUCA 170 details). In the fourth step, the uORF-tBLASTn hits were subjected to mORF-tBLASTn analysis, and 188 uORF-tBLASTn and mORF-tBLASTn hits were extracted. Plant EST and TSA databases can include 189 contaminant sequences from other organisms, such as parasites, plant-feeding insects and infectious 190 microorganisms. We checked the possibility that the extracted uORF-tBLASTn and mORF-tBLASTn hits 191 included contaminant EST/TSA sequences, using BLASTn searches. The BLASTn searches were performed 192 using each uORF-tBLASTn and mORF-tBLASTn hit EST/TSA sequence as a query against EST/TSA and 193 RefSeq RNA sequences from all organisms, with an E-value cutoff of 10 -100 and an identity threshold of 95%. 194 Contaminant EST/TSA sequences were identified by this analysis, as described in Materials and Methods,and 195 were removed from the uORF-tBLASTn and mORF-tBLASTn hits. We selected uORFs whose remaining 196 uORF-tBLASTn and mORF-tBLASTn hits were found in homologs from at least two orders other than that of 197 the original uORF. Thereafter, we generated amino acid sequence alignments of each selected uORF and its 198 homolog, using a putative homologous uORF sequence from each order in which uORF-tBLASTn and 199 mORF-tBLASTn hits were found. When multiple original uORFs derived from splice variants of the same gene 200 partially or completely shared amino acid sequences, the one with the longest conserved region was manually 201 selected on the basis of the uORF amino acid sequence alignments. In the fifth step, the remaining uORFs were 202 subjected to K a /K s analysis. The uORFs with K a /K s ratios less than 0.5 showing significant differences from those 203 of negative controls (q < 0.05) were selected as candidate CPuORFs (Supplementary Table S1 Table S1). The amino acid sequences of the remaining candidate CPuORFs are not similar to 208 those of the known CPuORFs. Therefore, 17, 6, 12, 71, and 35 novel candidate CPuORFs were extracted from 209 Arabidopsis, rice, tomato, poplar, and grape genomes, respectively. 210 211

Validation of candidate CPuORFs 212
If the amino acid sequence of a uORF is evolutionarily conserved because of functional constraints of the 213 uORF-encoded peptide, it is expected that the amino acid sequence in the functionally important region of the amino acid sequences in the same region are conserved among uORF sequences in the alignment of each novel 216 candidate CPuORF. We found that the alignments of 13 novel candidate CPuORFs contain sequences that do not 217 share the consensus amino acid sequence in the conserved region, and removed these sequences from the 218 alignments. We also removed sequences derived from genes not related to the corresponding original 219 uORF-containing gene from the alignments of five novel candidate CPuORFs. When these changes resulted in 220 the number of orders with the uORF-tBLASTn and mORF-tBLASTn hits becoming less than two, the candidate 221 CPuORFs were discarded. Eight novel candidate CPuORFs were discarded for this reason. Supplementary 222 Figure S1 shows the uORF amino acid sequence alignments without the removed sequences. The K a /K s ratios 223 were recalculated after the manual removal of the sequences (Supplementary Table S1), and six additional novel 224 candidate CPuORFs were discarded because their K a /K s ratios were greater than 0.5.  Protein sequences with an N-terminal region similar to the amino acid sequence encoded by the 5′-extended region 231 of the mORF in this splice variant are found in most orders from which the uORF-tBALASTn and 232 mORF-tBLASTn hits of this candidate CPuORF were extracted, suggesting that the splice variant with the 233 5′-extended mORF is not a minor form among orthologous transcripts. Therefore, this candidate CPuORF was 234

discarded. 235
In the second step of ESUCA, we excluded uORF sequences likely to encode parts of the mORF-encoded 236 proteins, by removing uORFs with high uORF-mORF fusion ratios. To confirm that the novel candidate 237 CPuORFs do not code for parts of the mORF-encoded proteins, each of the putative uORF sequences used for 238 the alignment and K a /K s analysis was queried against the UniProt protein database (https://www.uniprot.org/), 239 using BLASTx. When putative uORF sequences matched protein sequences with low E-values, we manually 240 checked whether amino acid sequences similar to those encoded by the putative uORFs were contained within ortholog, POPTR_0019s07850, were identified in many orders. This suggests that the sequences encoded by 244 these candidate CPuORFs are likely to function as parts of the mORF-encoded proteins. Therefore, we discarded 245 these candidate CPuORFs. For some other novel candidate CPuORFs, mORF-encoded proteins with sequences 246 similar to those encoded by the candidate CPuORF and/or its homologous putative uORFs were also found. 247 However, we did not exclude these candidate CPuORFs, because such uORF-mORF fusion type proteins were 248 found in only a few species for each candidate. 249 After manual validation, 12, 4, 11, 70, and 34 uORFs were identified as novel CPuORFs in Arabidopsis, rice, 250 tomato, poplar and grape, respectively. Among these novel CPuORFs, those with both similar uORF and mORF 251 amino acid sequences were classified into the same HGs. Also, using OrthoFinder ver. 1.1.4 [32], an algorithm 252 for ortholog group inference, we classified the genes with novel CPuORFs and those with previously identified 253 CPuORFs into ortholog groups. The same HG number with a different sub-number was assigned to CPuORFs of 254 genes in the same ortholog group with dissimilar uORF sequences (e.g. HG56.1 and HG56.2). Of the newly 255 identified CPuORF genes, six were classified into the same ortholog groups as previously identified CPuORF 256 genes, but the amino acid sequences of these six CPuORFs are dissimilar to those of the known CPuORFs. 257 Including this type of CPuORFs, we identified 131 novel CPuORFs that belong to 88 novel HGs (HG2.2, HG9.2, 258 HG16.2, HG43.2, HG50.2, HG52.2 and HG54-HG130) (Supplementary Table S1).  Table S2K). To focus on the sequence-dependent effect of the CPuORF on mORF translation, 284 the upstream uORF was eliminated by mutating its initiation codon because the presence of the immediate 285 upstream uORF may reduce the translation efficiency of the CPuORF and therefore potentially make the effect of 286 the CPuORF ambiguous. The 5´-UTR sequences containing the selected CPuORFs were fused to the firefly 287 luciferase (Fluc) coding sequence and were placed under the control of the 35S promoter to generate the 288 wild-type (WT) reporter constructs (Fig. 6a, Supplementary Figure S2). To assess the importance of the amino 289 acid sequences for the effects of these CPuORFs on mORF translation, frameshift mutations were introduced into 290 the CPuORFs so that the amino acid sequences of their conserved regions could be altered. A + 1 or − 1 291 frameshift was introduced upstream or within the conserved region of each CPuORF, and another frameshift was 292 introduced before the stop codon to shift the reading frame back to the original frame (Supplementary Figure S2). 293 These reporter constructs were each transfected into protoplasts from Arabidopsis thaliana MM2d 294 suspension-cultured cells. After 24 h of incubation, cells were harvested and disrupted to analyze luciferase 295 activity. In five of the 11 CPuORFs, the introduced frameshift mutations significantly upregulated the expression 296 of the reporter gene, indicating that these CPuORFs repress mORF translation in a sequence-dependent manner of the CPuORFs conserved only among rosids. Therefore, this result suggests that CPuORFs conserved only 299 among rosids can have sequence-dependent regulatory effects. 300

Comprehensive identification of angiosperm CPuORFs by ESUCA analyses of multiple species' genomes 303
In this study, we developed ESUCA, a pipeline for efficient genome-wide identification of CPuORFs. By 304 applying ESUCA to five angiosperm genomes, we identified 131 CPuORFs that belong to 88 novel HGs. Of 305 these HGs, 70 were identified through ESUCA analysis of only one of the Arabidopsis, rice, tomato, poplar or 306 grape genomes (Fig. 5). This means that CPuORFs belonging to these HGs cannot be identified by comparing 307 uORF sequences of orthologous genes between these five species. Therefore, this result demonstrates that the 308 approach used in this study, in which uORF sequences from multiple species are compared with those of many 309 other species, is highly effective in comprehensively identifying CPuORFs. We expected this approach to be 310 particularly useful for comprehensive identification of CPuORFs conserved in relatively narrow taxonomic 311 ranges. Therefore, we used the five angiosperm species belonging to relatively distant lineages in order to identify 312 CPuORFs conserved only among the taxonomic categories to which each of the five species belong. However, 313 unexpectedly, in 41 of 70 HGs identified through ESUCA analysis of only one of the five plant genomes, 314 CPuORF sequences are conserved in both rosids and asterids, two major groups of eudicots, indicating that these 315 CPuORFs are conserved across diverse eudicots (Fig. 5). One possible explanation of this observation is that 316 sequence conservation of uORFs is often lost in small taxonomic groups, such as orders or families, during 317 evolution. For example, while the CPuORF of the tomato LOC101264451 gene, which belongs to HG43.1, 318 exerts a sequence-dependent repressive effect on mORF translation, the CPuORF of its Arabidopsis ortholog, 319 ANAC096, lacks the C-terminal half of the amino acid sequence in the highly conserved region and does not have 320 a sequence-dependent regulatory effect [25][26][27]. In contrast to the Arabidopsis ANAC096 CPuORF, all the critical 321 amino acid residues in the highly conserved region are retained in the HG43.1 CPuORF of Tarenaya hassleriana, 322 which belongs to the same order as Arabidopsis, Brassicales, but a different family. CPuORFs involved in 323 preferable but not essential post-transcriptional regulations may be lost in some taxonomic clades during 324 evolution. Such CPuORFs would be difficult to be identified by comparing uORF sequences between a few selected species if a selected species is included in clades where the CPuORF sequences were lost. Therefore, the 326 ESUCA method is advantageous for comprehensive identification of CPuORFs compared with conventional 327 comparative genomic approaches. Our results reveal that the approach used here (i.e. ESUCA analyses of 328 multiple species' genomes) is further advantageous and is highly useful not only for comprehensive identification 329 of narrowly conserved CPuORFs but also for that of widely conserved CPuORFs. 330 The transient expression assays in the current study identified five novel poplar regulatory CPuORFs that 331 exert sequence-dependent repressive effects on mORF translation. One of the identified regulatory CPuORFs is 332 CPuORFs are conserved. Our results demonstrate that this function of ESUCA is highly useful for the efficient belonging to HG55 and HG66 showed no significant sequence-dependent effect on mORF translation, despite 354 their widespread sequence conservation beyond eudicots (Figs. 5 and 6b). These CPuORFs might encode 355 peptides that have functions other than the control of mORF translation, or they might exert sequence-dependent 356 regulatory effects only under certain conditions. In fact, many known sequence-dependent regulatory uORFs 357 repress mORF translation in response to metabolites, such as polyamine, arginine, and sucrose [10,14,15,17]. 358 Likewise, the other CPuORFs that exhibited no significant sequence-dependent regulatory effect might encode 359 peptides that have other functions or exert regulatory effects only under specific conditions. 360 361

Filtering using the uORF-mORF fusion ratio 362
To distinguish between 'spurious' CPuORFs conserved because they code for parts of mORF-encoded proteins 363 and 'true' CPuORFs conserved because of functional constraints of their encoded small peptides, we employed 364 the criterion of the uORF-mORF fusion ratio and discarded uORFs with uORF-mORF fusion ratios equal to or 365 greater than 0.3. We checked how effectively the 'spurious' CPuORFs were removed with this criterion, using a 366 protein sequence database. Although uORF-mORF fusion type protein sequences were found in OsUAM2 367 homologs from widespread angiosperm species, no other candidate CPuORFs were extracted in which 368 uORF-mORF fusion type protein sequences were found in many species that have sequences similar to the 369 candidate CPuORFs. This indicates that the uORF-mORF fusion ratio filtering worked effectively to exclude 370 'spurious' CPuORFs that code for parts of the mORF-encoded proteins. 371 ESUCA extracted all known plant cis-acting regulatory CPuORFs that control mORF translation in a 372 sequence-dependent manner (i. e. HG1, HG3, HG6, HG12, HG13, HG14, HG15.3, HG18, HG19, HG27 Table S1). In this study, 11 CPuORFs belonging to this HG were extracted using ESUCA and the uORF-mORF 377 fusion ratios of these CPuORFs were in the range of 0.17 to 0.27, with a median value of 0.26 (Supplementary species. The uORF-mORF fusion ratios of the candidate CPuORFs of rice OsUAM2 and its poplar ortholog were 381 0.28 (Supplementary Table S1). The rice OsUAM2 gene codes for a protein similar to UDP-arabinopyranose 382 mutases (EC 5.4.99.30) [40]. Two protein isoforms, a uORF-mORF fusion type and one lacking the 383 uORF-encoded region, produced from splice variants of an OsUAM2 orthologous gene, are found in diverse 384 angiosperm species. The functions of both of isoforms are not yet known. Considering the example of HG3 385 CPuORFs, we cannot rule out the possibility that a candidate CPuORF functions as a regulatory uORF even if 386 uORF-mORF fusion type protein sequences are found in widespread species. However, to avoid including 387 potential 'spurious' CPuORFs whose amino acid sequences are likely to be evolutionarily conserved because of 388 their function as N-terminal regions of the mORF-encoded proteins, we excluded the candidate CPuORFs of the 389 rice OsUAM2 gene and its poplar ortholog. In conclusion, the criterion of the uORF-mORF fusion ratio used in 390 this study appears appropriate because all known cis-acting sequence-dependent regulatory CPuORFs were 391 extracted and most 'spurious' CPuORFs were removed. 392 393

394
The present study demonstrates that the approach in which uORF sequences from multiple species are compared 395 with those of many other species, using ESUCA, is highly effective in comprehensive identification of CPuORFs. 396 Using this approach, we identified many novel angiosperm CPuORFs, which include CPuORFs conserved 397 among limited clades and widely conserved CPuORFs. Our results also showed that ESUCA is capable of 398 efficiently selecting CPuORFs likely to be conserved because of functional importance of their encoded peptides. 399 The approach used here can be applied to any eukaryotic organism with available genome and transcript 400 sequence databases and therefore is expected to contribute to the comprehensive identification of CPuORFs 401 encoding functional peptides in various organisms. Furthermore, besides CPuORFs, the algorithms developed for 402 ESUCA and the approach used here can be applied to the identification of other sequences conserved in various 403 taxonomic ranges. 404 acid sequence of the mORF associated with the original uORF was used as a query sequence, and transcript 465 sequences matching the mORF with an E-value less than 10 -1 were extracted. For each of the uORF-tBLASTn 466 and mORF-tBLASTn hits, the upstream in-frame stop codon closest to the 5'-end of the region matching the 467 original mORF was selected, and the 5'-most in-frame ATG codon located downstream of the selected stop 468 codon was identified as the putative mORF initiation codon (Fig. 3). uORF-tBLASTn and mORF-tBLASTn hits 469 were discarded as uORF-mORF fusion type sequences if the putative mORF overlapped with the putative uORF. 470 The original uORF was selected as a candidate CPuORF if the remaining uORF-tBLASTn and 471 mORF-tBLASTn hits belonged to at least two orders other than that from which the original uORF was derived. 472 473

Identification of contaminant ESTs and TSAs 474
In the uORF-tBLASTn and mORF-tBLASTn analyses described above, we excluded tBLASTn hit ESTs and 475

K a /K s analysis 493
For K a /K s analysis of each candidate CPuORF, a putative uORF sequence was selected from each order in which 494 uORF-tBLASTn and mORF-tBLASTn hits were found, using the following criteria. First, we selected transcript 495 sequences that matched the mORF associated with the candidate CPuORF with an E-value less than 10 -20 in 496 mORF-tBLASTn analysis. Of these sequences, we then selected the one with the smallest geometric means of 497 mORF-tBLASTn and uORF-tBLASTn E-values. When mORF-tBLASTn E-values of all uORF-tBLASTn and 498 mORF-tBLASTn hits in an order were equal to or greater than 10 -20 , we selected the transcript sequence with the 499 smallest geometric means of mORF-tBLASTn and uORF-tBLASTn E-values in the order. Putative uORF 500 sequences in the selected transcript sequences were used for generation of uORF amino acid sequence 501 alignments and for K a /K s analysis. Multiple alignments were generated by using standalone Clustal Omega 502 For statistical tests of K a /K s ratios, we calculated the distribution of mutation rates between the original uORF 506 and its homologous putative uORFs and those between the original uORF and its artificially generated mutants, 507 using the observed mutation rate distribution. Then, observed empirical K a /K s ratio distributions were compared 508 with null distributions (negative controls) using the Mann-Whitney U test to validate statistical significance. The 509 one-sided U test was used to investigate whether the observed distributions were significantly lower than the null 510 distributions. Adjustment for multiple comparisons was achieved by controlling the false discovery rate using the 511 Benjamini and Hochberg procedure [47]. 512 513

Determination of the taxonomic range of uORF sequence conservation 514
To automatically determine the taxonomic range of the sequence conservation of each CPuORF, we first defined 515 13 plant taxonomic categories. The 13 defined taxonomic categories are lamiids, asterids other than lamiids, 516 mavids, fabids, eudicots other than rosids and asterids, commelinids, monocots other than commelinids, Taxonomy, the uORF-tBLASTn and mORF-tBLASTn hit sequences selected for K a /K s analysis were classified 521 into the 13 taxonomic categories (Fig. 4). It should be noted that, in NCBI taxonomy, eudicots, Angiospermae, 522 and Ggymnospermae are referred to as eudicotyledons, Magnoliophyta, and Acrogymnospermae, respectively. 523 For each CPuORF, the numbers of transcript sequences classified into each category were counted and shown in 524 Supplementary   Step 1: Extraction of uORF sequences from a transcript sequence dataset uORF-mORF fusion ratio ≥ 0.3 uORF-mORF fusion ratio < 0.3 Step 2: Calculation of uORF-mORF fusion ratios uORFs with no tBLASTn hits from the other orders uORFs with tBLASTn hits from the other orders Step 3: Homology searches of the uORF amino acid sequences against a transcript sequence database (uORF-tBLASTn) uORFs conserved among homologs from more than two orders Step 4: Selection of uORFs conserved among homologous genes (mORF-tBLASTn) uORFs not conserved among homologs from more than two orders Discarded K a /K s ratio ≥ 0.5 or q ≥ 0.05 K a /K s ratio < 0.5 and q < 0.05 Step 5: Calculation of K a /K s ratios Step 6: Determination of the taxonomic range of uORF sequence conservation