Models for the evolution of p1 alleles
A distinguishing feature of the p locus is its tremendous allelic diversity, which makes it a preferable locus to study evolutionary changes and chromosomal dynamics on a larger and smaller scale. Although the grass family arose by an ancient whole genome duplication (WGD) event [24], the p gene has only a single ortholog in rice and sorghum, indicating that one copy was lost from the paleoploid ancestral genome. However, the more recent allotetrapolidization event, which formed the ancestor of maize about 5 mya [9], gave rise to two p copies located in the homoeologous regions of chromosomes 9 and 1. The copy on chromosome 1 was then duplicated in tandem 2.75 mya, thereby evolving into the current p2 (ortholog) and p1 (paralog) genes [8, 10]. The bulk of retrotranspositions in most grasses occurred more recently. A series of nested insertions that split approximately 80 bp of the p 3' UTR occurred between 1.4 to 0.2 mya [10]. Although retroelements are highly repetitive in the genome, insertions of retroelements in a nested fashion create unique sequence junctions and become chromosomal markers [25]. However, we do not know whether the retroelements transposed into the paralog or ortholog repeat, or maybe even into a later-generated copy. A model proposed for the evolution of single-copy alleles states that the retroelement insertion occurred in the 3' UTR of p2, thereby separating p1, which turned into P1-rr and P1-rw[8, 17] (Figure 3). In contrast, transposition into the 3' UTR of p1 retains the repeat structure and allows the amplification of additional copies by unequal crossover as suggested for the evolution of the multi-copy P1-wr[B73] allele [10] (Figure 3). Theoretically, only few recombination events are needed to transfer p1 and p2 sequences across the retroelement cluster. Therefore multi-copy alleles in a tandem array could have been created upstream, downstream or on both sides of the cluster simultaneously. These intermediate structures that enable us to discover the step-by-step evolution of all p alleles might still exist in the maize germplasm. Our current analysis allows us to present new and refined models for the evolution of p1-ww[4Co63], P1-rr4B2 and P1-rw1077.
Models for the evolution of p1-ww[4Co63]
p1-ww is a null allele because p1 specific sequences such as coding-, promoter-and proximal enhancer sequences are absent in 4Co63 (Figure 3). Is it possible that p1-ww[4Co63] represents a haplotype where the tandem duplication of the ancestral p gene never took place? According to this hypothesis, the nested retroelement insertions in 4Co63 that are identical to alleles containing p1 and p2 sequences must have happened before the p duplication event. However, since the p duplication occurred 2.75 mya, 1.37 million years before the first retroelement insertion, we can disregard this possibility [8, 10]. Thus, the alternative explanation that a functional p1 allele was deleted to give rise to p1-ww[Co63] is more likely. The p1-ww[4Co63] structure does not reveal the functional p1 allele(s) and their deletion or recombination events that resulted in the current null allele. Considering that p1 alleles are located on both sides of the retroelement cluster, multiple recombination events could have occurred to create the p1-ww[4Co63] allele.
One possible scenario for the origin of p1-ww in inbred line 4Co63 is that the null allele, which carries a functional p2, is derived from P1-wr[B73]. While unequal crossover among repeat sequences can lead to an increase of copy numbers, the alternative outcome is a reduction of repeats. During the evolution of the P1-wr[B73] allele, unequal crossover between the flanking genes of the cluster, namely p2/p1[B73] and p1/p2[B73], could have caused the deletion of all P1-wr[B73] repeats, and would still have generated a functional p2 gene (Figure 7A). However, p2 in 4Co63 differs by various SNPs and indels from the corresponding p2/p1[B73] and p1/p2[B73] sequences of P1-wr[B73], indicating that P1-wr[B73] might not be the immediate progenitor for p1-ww[4Co63].
p1-ww[4Co63] also could have evolved by a recombination event that involved two different p1 alleles. Unequal crossing over between p2 of P1-rr4B2 and p1/p2[B73] of P1-wr[B73] could have generated the current p1-ww[4Co63] structure and could have restored the p2 copy (Figure 7B). Even then, the deletion of the original paralog would have been derived from the P1-wr[B73] allele. Nevertheless, this could not have happened recently (on an evolutionary time scale) because of sequence polymorphisms in the participating alleles. Interestingly, both p1-ww[4Co63] and P1-wr[B73] carry as a signature the Shadowspawn retroelement in the same position, indicating that p1-ww[4Co63] most likely derived from P1-wr[B73] in multiple steps.
In addition, we can envision a P1-rw-like allele, which is similar to P1-wr[B73] in the distal enhancer structure. Such a P1-rw allele has been described [17]. An unequal crossover between the large repeats flanking the coding regions duplicates the p1 gene or deletes the coding sequences, resulting in the p1-ww[4Co63] structure (Figure 7C). This scenario resembles the origin of p1-ww1112[14]. However, this model does not directly account for the Shadowspawn retroelement in p1-ww[4Co63]. All models demonstrate the complexity of the p locus and reveal the countless possibilities for recombination to occur whenever paralogous sequences are present.
Model for the evolution of P1-rw1077 and P1-rr4B2, with focus on regulatory sequences
Despite the repeat structure of the P1-wr[B73] cluster, a single P1-wr[B73] copy has the least complicated p1 allele composition, followed by P1-rw1077 and then P1-rr4B2. We hypothesize that P1-rw1077 originated from a P1-wr- like tandem array (Figure 5) because P1-rw1077 comprises a sequence fragment downstream of the p2 section that is virtually identical with the junction sequence of two P1-wr[B73] repeats in a head-to-tail assembly. This P1-wr tandem array could have been located on either side of the retroelement cluster.
A plausible sequence of events is as follows. A Mu-like element inserted into one of the P1-wr repeats 1,204 bp after exon 3 of the previous copy. Then an aberrant transposition event (abortive excision event) of this MULE caused a DNA double-strand break that enabled exonucleases to digest the unprotected DNA ends (Figure 5) thereby extending the gap into the adjacent P1-wr repeat. The deletion would have included MULE sequences (about 3.5 kb compared to a putative autonomous element) and almost the entire length of a P1-wr repeat (more than 12 kb). Non-homologous end-joining [21–23, 26], copying a 9-bp sequence (AACCTATGT) that is located 27 bp downstream from the deletion endpoint, must have repaired the break. Due to the nature of tandem repeats, the large deletion described above results in small repeats of 203 bp that are flanking the MULE fragments. Interestingly, this duplication is part of a 1.2-kb sequence that contains the enhancer element of P1-rr.
A single P1-wr- like allele downstream of the retroelement cluster that is flanked by large repeats due to the retroelement insertion in p2 could have been converted into a tandem array by unequal crossover between the large repeats (Figure 7C). Gene conversion events then could have transferred the altered region that originated at the 3' large repeat to the 5' large repeat where the distal enhancer sequence functions [17]. Alternatively, P1-rw1077 arose from P1-wr repeats upstream of the retrotransposon cluster. The sequence 3' of this cluster, which corresponds to the 3' intergenic region of p2 as found in the P1-wr[B73] cluster and p1-ww[4Co63], is nearly identical with the 5' end of a P1-wr[B73] repeat over a stretch of 5.2 kb (Figure 2 and Additional file 6: Supplemental Figure S4A). Due to this sequence similarity, a recombination event between p1-ww[4Co63] and the proposed P1-rw1077 precursor could have occurred that positioned P1-rw1077 downstream of the retrotransposon cluster (Additional file 6: Supplemental Figure S4A). This arrangement assumes that the P1-rw1077 allele resembles p1-ww[4Co63] at the 5' end. After the recombination break point, P1-rw1077 has to be closer to P1-wr[B73] because, based on our model, P1-rw1077 is derived from P1-wr. Indeed, a sufficient amount of polymorphisms between the p1-ww[4Co63] and P1-wr[B73] alleles enables us to verify the predicted structure and to place the possible recombination site between 567 and 713 bp after the point of p1-ww[4Co63] and P1-wr[B73] alignment. Further recombination/gene conversion events contributed to the evolution of the present P1-rw1077 allele.
The presence of the MULE fragments and filler DNA in P1-rr in exactly the same sequence context as in P1-rw1077 agrees with our model that P1-rr continued to evolve from P1-rw1077. In our model for the origin of P1-rr, we propose a second DNA double-strand break (DSB) that occurred in P1-rw1077 in between the stop codon and the MULE insertion (Figure 6). In contrast to the first DSB, there is no evidence for the participation of a TE, leaving the cause for the DSB unknown. Exonuclease activities expanded the gap until both ends were joined in a NHEJ fashion by synthesizing two short DNA pieces (filler DNAs) from sites close to the deletion end points into the repair site. The DSB repair caused a deletion of 1,410 bp across the repeat junction that spanned almost the entire sequence from the stop codon to the MULE fragments. Interestingly, this intermediate P1-rr structure can be found at the 3' end of P1-rr1088, P1-rrCFS36 and P1-rwCFS342[17].
The 5' transposon fragment happened to contain an 8-bp sequence close to the TIR (55-62 bp) that is present 1,269 bp further downstream as well. Unequal crossover between those 8 bp resulted in a tandem direct duplication of this 1,269 bp sequence. Accordingly, the final 318 bp of exon three, being part of the repeat, were replicated, too. A sequence at the 3' end of the first repeat was adopted as a splice acceptor site thereby generating a fourth exon. Although alternative splicing of exon 1, 2 and 4 has been reported, the protein product is of unknown function or may not have any function at all [27].
This putative evolutionary pathway explains how the P1-wr 3' UTR was almost entirely replaced by a MULE, how the fourth exon unique to P1-rr was generated and how the 1,269 bp SalI fragment containing the P1-rr distal enhancer was nearly completely duplicated (the initial 175 bp of the enhancer region are missing from the first repeat). Subsequently, gene conversion events could have placed part of the modified enhancer sequence of the downstream copy to the upstream large repeat [17]. Alternatively, if this P1-rr module arose on the P1-wr[B73] side of the retroelement cluster as we also discussed for P1-rw1077, then a recombination event between p1-ww[4Co63] and the P1-rr ancestor could have transferred the P1-rr end to a position downstream of the retrotransposon cluster (Additional file 6: Supplemental Figure S4B). The crossing over took place in the 595 bp sequence between the duplicated MULEs, which is why the repeat structure of P1-rr at the 5' end differs from the 3' end whereas they are identical in P1-rw1077. Lastly, a 1.6 kb hAT-like transposable element inserted 340 bp upstream of the MULE or 159 bp 5' of the enhancer region. This transposition did not occur in P1-rr1088[17]. Taken together, the novel distal enhancer structure of P1-rr could be the result of a MULE insertion and excision, deletion and repair by NHEJ, and duplication and deletion by recombination. This series of events from P1-wr to P1-rr confirms the sequential model of P1-rw and P1-rr evolution based on phylogentic analysis [17].
Function of the enhancer region rearrangements on p1 expression
When the p1 paralog was formed, it probably included the complete p coding sequence and the basal promoter that controls p expression in silk tissue. Then the paralog acquired two additional regulatory sequences adding equally to the basal expression in pericarp and glume. The enhancer sequences were identified and tested in transient and transgenic plants using P1-rr fragments fused to a GUS reporter gene [19, 20]. A 1-kb sequence adjacent to the promoter contains a regulatory sequence termed proximal enhancer while a 1.2-kb fragment further upstream includes a distal enhancer (Figure 2). The proximal enhancer region corresponds mostly to a truncated MULE that captured part of a host gene in between the TIR [10]. The proximal enhancer region and the basal promoter sequence are virtually identical in all sequenced p1 alleles to date (Figure 2). In contrast, the distal enhancer region varies in all p1 alleles as described above. Therefore, we hypothesize that the different spatial and temporal expression patterns of p1 alleles are caused by distinct distal enhancer regions [17]. The distal enhancer as defined in P1-rr is located within a 1,269-bp Sal I fragment [19, 20], out of which 671 bp are derived form the Mu-like transposon (Figure 2). Although this MULE fragment is missing in P1-wr[B73], transgenes constructed from P1-wr upstream regulatory sequences linked to P1-rr cDNA produced red pericarp and cob glumes in transgenic plants [28], indicating that the enhancer sequence is included in the 589-bp region downstream of the MULE. Since this 589-bp region is duplicated in P1-rr, P1-rr has two enhancer sites that are separated by the MULE fragment. Additional P1-rr alleles, namely P1-rr1088 and P1-rrCFS36, were shown to have the same enhancer structure as P1-rr4B2 with exception of the missing hAT insertion in P1-rr1088[17]. Therefore, the hAT transposable elements inserted in the upstream copy of the enhancer region of P1-rr4B2 and P1-rrCFS36 obviously do not disrupt the enhancer sequence and function. Compared to P1-rr, P1-rw1077 has a deletion of 381 bp in the upstream repeat, which causes the loss of cob glume pigmentation [13]. Interestingly, two additional P1-rw alleles, P1-rwCFS302 and P1-rwCFS342, lack the entire upstream repeat and the MULE fragment, thus having the identical enhancer arrangement as a single P1-wr[B73] copy [17]. Taken together, the analysis of three P1-rr and three P1-rw alleles revealed that P1-rr alleles contain two copies of the specific enhancer sequence while P1-rw alleles only have one [17]. Interestingly, this region coincides with a tissue-specific DNase I-hypersensitive site that remains closed in pericarp tissue of P1-pr, a silenced epiallele of P1-rr4B[29]; the P1-pr phenotype is shown in Figure 1A. It was reasoned that the upstream enhancer repeat that is missing in P1-rw1077 controls the glume-specific expression in a position-dependent manner, since the identical enhancer region is located 671 bp further downstream [13]. An alternative explanation was prompted by the fact that p1 expression in pericarp is weaker and delayed in P1-rw1077 compared to P1-rr. We hypothesize that the transcriptional strength of p1 alleles is correlated with the enhancer copy number, which is supported by similar findings in human upstream enhancers [30]. Consequently, P1-rw1077 produces less P1 protein than P1-rr in all expressing tissues. Also, each p1 allele is not expressed uniformly in female and male floral tissues within a plant. For example in P1-rr, p1 transcription is usually higher in pericarp than in cob glumes [15].
Therefore, we propose that the presence of only one distal enhancer site in the P1-rw1077 allele results in weak expression in pericarp tissue but no expression in cob glumes. Due to the duplication of the enhancer sequence as outlined in our model, p1 transcription in pericarp and glume tissue was equally elevated such that p1 is strongly expressed in pericarp and weakly expressed in glumes, thereby giving rise to P1-rr alleles. Note that comparisons with P1-wr alleles are not appropriate due to their post-transcriptional silencing, which potentially is repeat induced [31]. This model is supported by an analysis of the spatial expression pattern in transgenic plants where various p1 constructs were expressed only in few p expressing tissues, resembling P1-rr or P1-rw phenotypes. It was shown in these transgenic plants that p1 expression follows a spatial hierarchy that begins with pericarp and continues with cob glumes, husk, silk, and tassel glumes in decreasing order [32, 33]. For instance, if the transgenes had been expressed in only one tissue, then it would have had to be in pericarp, in the case of two tissues then in pericarp and glumes, and so on.
The p alleles differ in their 3' UTR
Polyadenylation is involved in many facets of mRNA metabolism including enhancement of mRNA stability, transport of mRNA from the nucleus into the cytoplasm, and regulation of mRNA translation. Although polyadenylation signals in plants are less conserved than in mammals [34], three signals were identified in maize, rice, and Arabidopsis: the far upstream element (FUE, located -150 to -35 nt upstream of the cleavage site), the near upstream element (NUE, situated -35 to -10 nt upstream of the cleavage site) and the cleavage element (CE, positioned -10 to +15 nt upstream and downstreams of the cleavage site) [35, 36]. As we have shown above, a fragmented MULE was placed adjacent to the P1-rr4B2 stop codon possibly due to a NHEJ event. All mapped polyadenlation sites of the P1-rr4B2 transcript are located within the MULE sequence, indicating that P1-rr4B2 successfully recruited alternative polyadenylation signals in the transposon. Similarly, a Mu insertion in the 3' UTR of the rf2a locus also resulted in the adoption of new polyadenylation signals and sites [37]. Retroelements, the most common transposons in maize, also insert in 3' UTRs without disrupting polyadenylation as demonstrated above for the p2 alleles. Our results suggest that polyadenylation in maize is a highly dynamic process which despite its importance for the cell is not tightly regulated. The large amount of polyadenylation sites found in our analysis of P1-wr[B73] transcripts that do not contain a transposon insertion supports this conclusion. A genome-wide analysis of genomic and transcript data could shed light on the mechanism of polyadenylation in maize and could establish the proportion of genes that terminate in transposable elements. Interestingly, it has been shown that many polyadenlylation signals in human and mouse genes have been derived from transposable elements [38]. Besides polyadenylation signals, transcriptional as well as translational regulators have been identified in the 3' UTR of plant and animal genes, and their gain or loss could cause allelic diversity. For example, targets of microRNAs are often located in 3' UTRs [38, 39].
Gene copying events promote allelic diversity
Recombination is crucial for the evolution of genomes [40, 41]. In particular, the non-homologous recombination pathway is frequently used to repair DNA double-strand breaks in somatic plant cells [26]. Previously, we reported a probable NHEJ event involved in the formation of the P1-wr[B73] cluster [10] that produced a hybrid gene due to the ligation of deletion end points located within two genes. Similarly, deletions and repair by NHEJ in the above mentioned alleles could have resulted in the restructuring of an enhancer region and formation of a novel 3' UTR.
The exceptional allelic variation at the p locus prompts the question about its similarities and differences to genes that exhibit less variation. We propose that the main cause for the diversity might lie in tandem gene amplification [8, 17, 42, 43]. Once a gene underwent an initial tandem duplication, multiple unequal recombination events can follow as seen in the P1-wr[B73] multi-gene cluster [10]. A single crossing over or gene conversion event between misaligned paralogous gene copies can generate many new alleles including deletion and amplification derivatives. Interestingly, in plants such events can occur mitotically and can be transmitted into the next generation, thereby increasing allelic variation [44]. This explanation then implies that other loci exhibiting an increased allelic variation are multi-copy genes as well. Indeed, the complex r1 locus in maize is analogous to p1 in many aspects. The r1 locus, which also encodes a transcription factor, confers bluish anthocyanin pigmentation to various vegetative and floral plant tissues. Two r1 alleles, R-st and R-r, are molecularly well characterized. R-st contains various r1 genes, four of which are in tandem orientation [45]. R-r consists of one complete and three truncated r1 genes that originated from tandem duplication [46, 47]. Comparable to p1 in complexity, both alleles undergo recombination and transposition events creating numerous derivative alleles. Paralogous gene copies in maize were also found at the pl1[48] and a1 loci [49]. Especially the prolamine gene family with nearly 50 copies distributed over several chromosomes exemplifies the outcome of gene duplications [50]. Actually, a large proportion of genes are tandemly duplicated in Arabidopsis, rice, and maize [51–53]. Considering the amount of paralogous sequences and their possibilities to recombine, a single reference genome providing just one allele can obviously not reflect this allelic potential of the maize genome. Not surprisingly, a recent genomic comparison between the B73 and Mo17 inbred lines [54] revealed a large quantity of copy number variations and presence/absence variations confirming previous results [55]. Nonetheless, epialleles remain invisible in a traditional sequence comparison. Allelic diversity studies as presented here are essential for our understanding of the remarkably dynamic maize genome.