Glycosylation mediated by glycosyltransferase enzymes (GTs) is a critical step in metabolic pathways with diverse roles in cellular processes and homeostasis . Recent studies involving functional characterization of plant GTs suggest their important roles in growth, development and interaction with the environment . The activities of many GTs from a variety of plants and biological roles of their products have been known for a long time . However, the methods for identification of UGTs based on biochemical and classical genetic approaches are slow and difficult . Recent developments in plant genomics stimulated the use of strategies such as differential display methods and/or homology-based screening of cDNA libraries for identification and isolation of novel UGT genes [24–26], although the roles of many UGTs still remain uncertain. Availability of whole genome sequence of many plants enabled a thorough and detailed analysis of multigene families. For example, in Arabidopsis, genome-wide search using PSPG motif identified 120 putative UGT genes. Similarly, a whole genome survey of six plant species resulted in identification of 56 (Carica papaya) to 242 (Glycine max) UGTs .
The recently published draft genome sequence and the extensive tissue specific EST library collections of flax provided an opportunity to investigate the diversity in flax UGT multigene family in a greater detail. We identified 137 flax UGTs, which is more than that identified in Arabidopsis but less than that discovered in rice, grapevine and Medicago . All the identified UGTs contain two major domains, a conserved C-terminal domain and a variable N-terminal domain, although the overall sequence diversity was high among the genes.
Flax UGT family resembles the phylogenetic group structure of Arabidopsis UGTs
A phylogenetic tree provides a framework to compare the properties of gene family members and to identify similarities and differences among them . In the present study, the flax genome revealed 22 UGT families including four new families (94, 97, 709 and 712), not reported in Arabidopsis. However, phylogenetic analysis of flax UGTs clustered them in 14 groups (A-N) as reported in Arabidopsis [7, 12] and interestingly, the four new flax UGT families did not form any additional groups. Moreover, all the six sequences of the UGT94 family clustered with the Sesamum indicum UGT94D1 sequence (BAF99027 ), and UGT94B1 (AB190262 ) are the only UGT94 family sequence reported till now. A phylogenetic tree constructed by Bowles et al. using 22 UGT sequences reported from other plant species along with the Arabidopsis UGT sequences, mostly resulted in 14 groups, while an additional group of cytokinin GTs was identified containing the Phaseolus vulgaris and Zea mays UGT sequences [31, 32]. Based on the phylogenetic analysis of Arabidopsis UGTs, it has been shown that it might be possible to correlate, to a large extent, the regiospecificity of glycosylation to the phylogenetic groups . The exception to this might be due to regioswitching events taking place during evolution. In some cases, phylogenetically closely related UGTs show distinct regiospecific differences towards a common acceptor. For example, A. thaliana UGTs, AtUGT74F1 and AtUGT74F2, share ~82% amino acid sequence identity, and while AtUGT74F1 glucosylates the phenolic hydroxyl group of 2-hydroxy benzoic acid, AtUGT74F2 glucosylates both the carboxyl and hydroxyl groups of 2-hydroxy benzoic acid . On the contrary, in some cases (e.g. UGT85B1), the genes have been shown to exhibit a broad specificity toward acceptors in vitro; however, a member of this group (UGT85Q1) in Sorghum bicolor specifically catalyzes the conversion of p-hydroxymandelonitrile into dhurrin in vivo. This analysis, along with amino acid sequence similarity of UGT families within a group, might be useful for predicting substrates [31, 36]. For example, Osmani et al. reported that the group G members glycosylate terpenoids; while the members of groups D, E and L glycosylate flavaonoids, tepenoids and benzoates.
However, a study of several Medicago truncatula UGTs highlighted the difficulties in assigning substrate specificity based on phylogeny. Biochemical and phylogenetic studies of MtUGT78G1 and MtUGT85H2 showed that substrate specificity could not be predicted by their clustering with biochemically characterized UGTs belonging to the same family . Although, few genomes such as rice, poplar, grapevine and Medicago have been screened and annotated for GT genes, they have not been assigned to GT groups and families so far. Apart from the model plant Arabidopsis , this is the first attempt to classify GT genes into groups and families from a crop plant flax, as per the standardized system recommended by the UGT Nomenclature Committee . Thus, the present analysis of flax UGT genes might help to narrow down the substrate choice of a specific gene.
Detection of orthologs and functional divergence of unique flax UGTs
Detection of orthologs is critically important for accurate functional annotation and has been widely used to facilitate the studies on comparative and evolutionary genomics . Several methods such as the BlastP , inparanoid  and reciprocal smallest distance  have been reported to detect orthologs. In the present study, we used BlastP to identify the orthologs for flax UGTs from four sequenced dicots (Ricinus communis, Populus trichocarpa
Vitis vinifera and Arabidopsis thaliana). Of the 137 flax UGTs, 130 UGTs had orthologs from the four dicots and seven flax-diverged UGTs were detected. Based on the microarray and EST data, 95 of these 130 orthologs (73%) showed expression evidence; while, five of the seven flax diverged UGTs revealed expression evidence, suggesting their functional divergence. Thus, the flax diverged UGTs, with significantly different primary sequences than those of other surveyed dicots, might have evolved independently since the last common ancestor between flax and these dicots. As the number of flax diverged UGTs identified in our analysis is small, other methods such as inparanoid search need to be conducted to identify more flax diverged UGTs that the present analysis might have missed. However, we could not perform this analysis, as the flax scaffold sequences are not yet publicly available for conducting the inparanoid search.
Intron mapping to understand the evolution of UGT family
To understand the evolution of a gene family within phylogenetic groups, introns, more specifically their position, phase, loss and gain, can serve as an important tool . Therefore, we conducted intron mapping in the 137 flax UGTs among which 40.14% sequences were intron less. This percentage is less than that observed in Arabidopsis, wherein >50% genes were intron less . In flax UGTs, a total of seven intron positions were identified with the number of introns per family in the range of one to four. Most families showed the presence of conserved introns 3 (53.65%) and 4 (32.92%), which could probably be considered as the oldest among the seven introns identified. Intron 3 was present in almost all members of the groups F-J and N; while intron 4 was dominant in groups L and K. Interestingly, in these groups wherever intron 3 was present, intron 4 was absent and vice versa except in case of LuUGT709E3, where both the introns were present; while in case of LuUGT87J2, both were absent. In other groups, the introns 3 and 4 were absent in some members of groups A, D, M and E. This suggests that either of these introns was gained prior to diversification of flax UGTs. This is also supported by the observation that most of the conserved introns were in the same phase.
It is a commonly held view that the majority of conserved introns are ancient elements and their phases usually remain unchanged . In fact, it has been further suggested that the intron sliding or shifts of intron-exon boundary over a few nucleotides causing change of intron phase are rare events and introns retain their phase for a long evolutionary time . Furthermore, the introns other than the conserved introns were found only within a single restricted group of closely related sequences or in only a single gene, suggesting a general pattern of intron gain during evolution of the flax UGT gene family. A clear case of loss of a conserved intron and gain of intron 5 was seen in the subfamily of closely related genes LuUGTB17
LuUGTB19 from group A. Similarly, in case of LuUGT73B12 and LuUGT73B13, loss of conserved introns and gain of intron 2 was also observed. Thus, analysis of the evolution of the flax UGT multigene family provides evidence for both intron gain and loss and thereby strongly supports the “intron-late” theory of intron evolution .
Expressed flax UGTs: identified by digital expression analysis and supported by RT-qPCR
Functional divergence among duplicated genes is one of the most important sources of evolutionary innovation in complex organisms. Interestingly, among the 22 duplicated genes, five pairs of genes LuUGT94G3 and LuUGT94G4
LuUGT73B12 and LuUGT73B13
LuUGT712B1 and LuUGT712B5
LuUGT86A8 and LuUGT86A9 and LuUGT74S5 and LuUGT74S6, showed evidence of differential expression. For example, LuUGT74S5 showed seed coat specific expression, while its duplicated counterpart, LuUGT74S6, remained unexpressed. Evidence for differential expression was also provided by the duplicated gene pair LuUGT86A8 and LuUGT86A9. This suggests that after duplication, the genes acquired either differential or tissue specific expression patterns. In an earlier study, Haberer et al. estimated that about two thirds of duplicate gene pairs had divergent expression in Arabidopsis.
To predict and understand the roles of these UGT genes in various tissue types, gene expression pattern analysis is very helpful to infer which gene family members are expected to perform distinct or similar roles. With this aim, we performed expression analysis of flax UGTs using EST libraries, microarray data and RT-qPCR. About 62% flax UGTs showed expression evidence based on the EST data and one or more ESTs were detected per tissue type, providing strong evidence that most of the flax UGT genes were expressed in varied tissue types. The expression patterns analysed using RT-qPCR very well correlated with the digital expression analysis.
The frequency of ESTs per UGT gene ranged from 1–54 among the UGTs, suggesting varied expression levels. Among the different tissue types, seed and stem tissues showed the highest number of expressed UGTs. It is known that flax seeds and stem contain a large number of secondary metabolites and hence could explain the abundance of UGTs in these tissues [48, 49]. However, this could also be due to a large number of EST libraries available for these tissue types (seed: 9 EST libraries, 2,20,724 ESTs and stem: 3 EST libraries, 32,184 ESTs). This study also identified two genes, LuUGT85Q2 and LuUGT74S1, belonging to groups G and L respectively, which showed high expression in flower and seed coat from the torpedo stage. The members of these groups are predicted to glycosylate terpenoids, flavanoids and benzoates classes ; and hence, they can be considered as potential targets for screening against these predicted classes to identify their substrates.
Compared to the sequence based expression analysis method, microarray provides a high-throughput tool for simultaneous analysis of expression at the whole transcriptome level. As per the microarray data, 44% flax UGTs showed expression evidence in various tissue types (Figure 3). Three genes from seed stage and one gene from leaf showed high expression, suggesting possible involvement of these genes in seed and leaf secondary metabolite glycosylation. Microarray data from two contrasting flax varieties, Drakkar and Belinka were also analyzed. Drakkar produces better quality fibres than Belinka, and is more resistant to the fungal pathogen Fusarium. However, we could not detect any UGT having variety specific expression pattern. Although, plant UGTs have been reported to be involved in defence mechanism , the available microarray data were not generated by exposing the varieties to any pathogen. The difference in expression of the UGTs between the EST and microarray datasets might have resulted from the differences in the number of tissue types, size of each dataset and varieties used for data generation. The EST dataset was larger compared to the microarray dataset, therefore we might have obtained expression evidence for more genes using the EST dataset. Moreover, the long sequence reads of ESTs provide fairly unambiguous evidence of gene expression, compared with the hybridization based microarray data and hence EST profiling could be considered as a more reliable method for transcriptomic analysis as also suggested by Geisler-Lee et al.  and Moreau et al..
Regarding the 37 unexpressed flax UGTs, it is possible that some or most of these genes may express at very low levels in particular tissue type or express only under specific conditions such as biotic or abiotic stresses. Hence, they might have not been represented in the EST and microarray data as the data were generated from unchallenged libraries. Even in the large Arabidopsis EST collection gathered over several years, only 64.5% of the genes had corresponding ESTs . Absence of an EST for a corresponding gene implies that it is either inactive or expressed at undetectable level in the tissues sampled or that it is a non-functional gene per se.