Divergence of gene regulation through chromosomal rearrangements
© Goettel and Messing; licensee BioMed Central Ltd. 2010
Received: 4 August 2010
Accepted: 30 November 2010
Published: 30 November 2010
Skip to main content
© Goettel and Messing; licensee BioMed Central Ltd. 2010
Received: 4 August 2010
Accepted: 30 November 2010
Published: 30 November 2010
The molecular mechanisms that modify genome structures to give birth and death to alleles are still not well understood. To investigate the causative chromosomal rearrangements, we took advantage of the allelic diversity of the duplicated p1 and p2 genes in maize. Both genes encode a transcription factor involved in maysin synthesis, which confers resistance to corn earworm. However, p1 also controls accumulation of reddish pigments in floral tissues and has therefore acquired a new function after gene duplication. p1 alleles vary in their tissue-specific expression, which is indicated in their allele designation: the first suffix refers to r ed or w hite pericarp pigmentation and the second to red or white glume pigmentation.
Comparing chromosomal regions comprising p1-ww[4Co63], P1-rw1077 and P1-rr4B2 alleles with that of the reference genome, P1-wr[B73], enabled us to reconstruct additive events of transposition, chromosome breaks and repairs, and recombination that resulted in phenotypic variation and chimeric regulatory signals. The p1-ww[4Co63] null allele is probably derived from P1-wr[B73] by unequal crossover between large flanking sequences. A transposon insertion in a P1-wr -like allele and NHEJ (non-homologous end-joining) could have resulted in the formation of the P1-rw1077 allele. A second NHEJ event, followed by unequal crossover, probably led to the duplication of an enhancer region, creating the P1-rr4B2 allele. Moreover, a rather dynamic picture emerged in the use of polyadenylation signals by different p1 alleles. Interestingly, p1 alleles can be placed on both sides of a large retrotransposon cluster through recombination, while functional p2 alleles have only been found proximal to the cluster.
Allelic diversity of the p locus exemplifies how gene duplications promote phenotypic variability through composite regulatory signals. Transposition events increase the level of genomic complexity based not only on insertions but also on excisions that cause DNA double-strand breaks and trigger illegitimate recombination.
An exciting challenge of biological research has been to understand phenotypic diversity within a species, which affects virtually every organ and cell type. In plants, this intraspecific diversity is often readily visible in the size, shape, color and number of flowers, fruits and seeds. Diversity can occur in every region of the gene, in coding regions or in regulatory sequences including upstream promoter and enhancer sequences, 5' and 3' UTRs and regulatory introns [1, 2]. Changes in regulatory regions affecting allele expression and transcript amount can be simple, such as small and large indels, or more complex, such as transposon insertions and structural rearrangements. Molecular mechanisms responsible for the sequence modifications are replication errors, recombination and transposition. Although the majority of allelic variation is due to nucleotide polymorphisms, phenotypic differences can be caused by epigenetic modifications such as DNA methylation [3, 4].
Phlobaphene pigmentation is most readily visible in the pericarp, i.e. the outer layer of the kernel, and the cob glumes. Traditionally, p1 alleles are phenotypically categorized and named based on expression in these tissues. The p1 gene designation is followed by a two-letter suffix that refers to pericarp and cob color, respectively. For instance, the P1-rr allele exhibits red pericarp and red cob glume pigmentation while the P1-rw allele has red pericarp and white or colorless cob glumes (Figure 1A). Each phenotypic p1 group may consist of structurally very different alleles. Only few p1 alleles have been structurally determined of which only a small number has been completely or partially sequenced. P1-rr4B2  and P1-rw1077  are single copy genes that both were introgressed into the inbred line 4Co63. This inbred line contains a loss-of-function p1-ww allele. P1-wr in inbred line B73 is a multi-copy allele, consisting of 11 P1-wr tandem repeats that are flanked by p2/p1 and p1/p2 hybrid genes upstream and downstream of the cluster, respectively . A large retroelement cluster is inserted in the 3' UTR of the p1/p2 hybrid gene.
The p1-ww alleles do not encode a functional P1 transcription factor; therefore pericarp as well as cob glumes are colorless (Figure 1A). While loss-of-function alleles often result in deleterious or even lethal conditions for the organism, non-functional p1 alleles do not cause any impairment that would reduce the fitness of the mutant plant. The p1-ww alleles can vary in origin and structure. Most of the structurally known p1-ww alleles are derived from P1-rr by transposon insertions and/or excisions. The p1-ww1112 null allele, for example, arose from a transposon-induced recombination event between the 5.2-kb direct repeats, which led to the deletion of the entire coding sequence . However, the origin of p1-ww allele in the inbred line 4Co63 is not known, but p1-ww[4Co63] is often used in genetic crosses. Brink, for instance, introgressed more than 100 p1 alleles in the inbred line 4Co63 . Knowledge of the p1-ww[4Co63] sequence could help clarify whether p1-ww[4Co63] is derived from P1-rr , P1-wr [10, 15], P1-rw  or even a different p1 allele and provide further insights into other intermediates of chromosomal rearrangements.
To shed light on the origin of p1 allelic variability, we analyzed here three p1 alleles in their chromosomal context, namely p1-ww[4Co63], P1-rr4B2 and P1-rw1077. First we resolve the structural organization of these p1 alleles and their corresponding p2 alleles on the single-nucleotide sequence level. Subsequently we compare their sequences also to the recently sequenced P1-wr[B73] cluster  to find large and small scale nucleotide polymorphisms that enable us to infer mechanisms for genome rearrangements. In particular, we focus on evolutionary changes in p1 alleles that occurred in the putative distal enhancer region and in the 3' UTRs.
The extensive similarity with the p1/p2[B73] 3' end and intergenic region together with the identical Shadowspawn insertion suggests that both sequences continue to be similar past the end of the lambda clone. To confirm this assumption, we extended the sequence by genomic PCR from the Shadowspawn element to neighboring genes that are unrelated to p and therefore do not participate in potential p recombination events. PCR primer pairs were designed based on the equivalent P1-wr[B73] cluster, and PCR products were cloned and sequenced [GenBank:HM454275]. The analysis of 6,587 bp revealed that 4Co63 and B73 are virtually identical in this sequence; they consist of the 3' end of the Shadowspawn retroelement, a gene encoding a calmodulin-binding protein, part of a gene encoding a protein of unknown function, and intergenic regions (Figure 2). The calmodulin-binding protein, which in 4Co63 measures 361 aa, is 7 aa larger than in B73 and also contains two amino acid substitutions due to 3 indels and 10 SNPs. Based on maize and other EST data, this gene is transcribed and is very conserved in grass species such as rice, sorghum and barley.
Interestingly, the retrotransposons at the 3' end of p2, namely Eninu and Ji, and at the 5' end of the above described " p1-ww " lambda clone, namely Opie and Eninu, are identical to the retroelement cluster of P1-wr[B73] in sequence, insertion site and consequently target site duplications. Although we did not clone and sequence the complete retroelement cluster in p1-ww[4Co63] it is most likely that both clusters in 4Co63 and B73 are identical, at least in their initial transposition of Eninu and their nested insertions of Ji and Opie (Figure 3).
In brief, whereas p2 is present and functional in the 4Co63 inbred line, p1 coding and regulatory sequences are missing with the exception of the distal enhancer region. The structure of the p1-ww[4Co63] allele does not unambiguously point to a single known p1 allele where p1-ww[4Co63] is derived from, although, mechanistically, unequal crossing over between flanking sequences of the p1 gene could have been involved as discussed below.
How does the sequence arrangement of P1-wr[B73] and p1-ww[4Co63] including their flanking genes compare to P1-rr4B2, a p1 single-copy allele that produces red pericarp and red glumes? P1-rr4B2 contains two large repeats flanking the coding sequence, which are about 5.2 kb in size [6, 12, 17]. Interestingly, the sequence upstream of the 5' large repeat contains fragments of Opie and Eninu retroelements inserted in the same position as in p1-ww[4Co63] and p1/p2[B73] as described above (Figures 2 and 3). Likewise, Eninu is bordered by the detached p 3' UTR sequence of 78 bp. Subsequently, P1-rr is highly similar to a single P1-wr[B73] copy with few exceptions: the upstream regulatory region is more complex in P1-rr than in P1-wr[B73] (Figure 2) and both sequences diverge shortly after the stop codon (see below). By sequencing two plasmids, SA206 and PA103, which contain the 3' large repeat and are derived from lambda clones used for the isolation of P1-rr 5' and coding sequences  (see Methods), we extended our P1-rr sequence analysis by 8,923 bp past the 3' large repeat and into flanking genes [GenBank:HM454276]. By aligning both large repeats we found 14 polymorphisms including the insertion of a transposable element of 1,616 bp in the 5' repeat (Figure 2). This element is flanked by 8-bp direct repeats (CCAGTGAG), which is typical for transposons of the hAT superfamily. The 3' large repeat following the upstream regulatory sequence resembles p1-ww[4Co63] but does not contain the Shadowspawn retrotransposon insertion (Figures 2 and 3). Furthermore, the final 4,341 bp of the plasmid insert, not related to P1-rr, are highly similar to the equivalent p1-ww[4Co63] and P1-wr[B73] sequences (Figures 2 and 3). The 3' flanking sequence contains one complete gene and one partial gene in opposite transcriptional orientation compared to P1-rr. The first gene, which is separated from P1-rr by 1,175 bp (measured from the end of the 3' P1-rr repeat to the stop codon), encodes the 4Co63-type calmodulin-binding protein consisting of 361 amino acids. No more than 609 bp of intergenic sequence divide the first from the second gene, of which only the final two exons are present in the plasmid clone.
The P1-rr sequence analysis revealed that P1-rr is located between the retroelement cluster and the gene encoding a calmodulin-binding protein. Most interestingly, the corresponding site in p1-ww[4Co63] and P1-wr[B73] is empty, i.e. this region does not contain a p1 gene copy (Figure 3). Based on the first maize p2 allele that was isolated from a line which contains the p1-ww1112 allele  we assume that a functional p2 allele of P1-rr4B2 is present upstream of the retroelement cluster because p1-ww1112 and P1-rr4B2 are both derived from the same allele. Furthermore, the p2[p1-ww1112] allele ends in Eninu and Ji retroelement fragments exactly like p2[4Co63] and p1/p2[B73] suggesting structural similarity among these alleles. Therefore, we decided to extend our sequence analysis to the p2 allele that is linked to P1-rr4B2. We used the same genomic PCR strategy as described above to clone and sequence 10,423 bp of p2[P1-rr4B2] [GenBank:HM454272]. Indeed, the alignment of p2[P1-rr4B2] with p2[p1-ww1112] showed no SNPs but only four 1-bp indels that are not part of exons or putative regulatory sequences. Hence both p2 alleles are coding for an identical P2 protein (Additional file 1 : Supplemental Figure S1). As expected, p2[P1-rr4B2] is also flanked by Eninu and Ji retroelement sequences. Introgression of P1-rr4B2 in 4Co63 probably included p2 as well because p2[P1-rr4B2] differs from p2[4Co63].
The P1-rw allele specifies red pericarp and colorless cob glumes (Figure 1A). In general, the structure of P1-rw1077 resembles P1-rr4B2 . P1-rw1077 is a single-copy gene, which consists of a coding region flanked by two 6.3-kb direct repeats (Figure 2). The coding sequence of P1-rw1077 is chimeric in nature. While the 5' UTR is similar to p1, the remaining coding region and adjacent Eninu and Ji retroelements (spanning about 6.9 kb) are p2 -like (Figure 3) . Sequence alignments establish that the p2 fragment is more closely related to p2[P1-rr4B2]/[p1-ww1112] than to p2[4Co63]. Interestingly, the Ji retrotransposon is followed by a truncated P1-wr -like exon, which is not included in the P1-rw1077 transcript. This organization of sequences suggests that P1-rw1077 originated from a gene conversion event between p1 and p2 . The P1-rw1077 sequence upstream of the 5' large repeat is very similar to the corresponding P1-rr4B2 region, suggesting that both alleles occupy the same chromosomal location. We confirmed this by PCR-amplification and sequencing of a 1,651-bp fragment that connects the 3' large repeat of P1-rw1077 with the gene encoding the calmodulin-binding protein. P1-rw1077 was introgressed in 4Co63, and indeed the 3' end of the intergenic region between the 3' large repeat and the neighboring gene is indistinguishable from 4Co63. Since P1-rw1077 and P1-rr4B2 occupy the same chromosomal position we wanted to find out whether the similarity extends to the region upstream of the retrotransposon cluster (Figure 3). We performed genomic PCR as described above to amplify and subsequently sequence 11,313 bp [GenBank:HM454273] that are 99.8% identical to the p2[4Co63] sequence. The 18 SNPs and 3 short indels, which are distributed over a consensus sequence of 10,703 bp, are not included in the p2[P1-rw1077] coding sequence and consequently do not alter the P2 protein sequence (Additional file 1 : Supplemental Figure S1). The polymorphisms between p2[P1-rw1077] and other p2 alleles suggest that this p2 sequence was introgressed together with P1-rw1077 into the 4Co63 background. This implies that the p2 part of P1-rw1077, which is p2[P1-rr4B2] -like, is derived from a p2 source other than p2[P1-rw1077]. The p2[P1-rw1077] 3' sequence is also flanked by Eninu and Ji retroelement fragments, linking p2[P1-rw1077] to P1-rw1077 across the retrotransposon cluster (Figure 3).
In summary, while p1 alleles can be located on both sides of the retroelement cluster (Figure 3), complete p2 alleles have so far only been found upstream of the retroelement cluster.
Because all known p1 alleles produce almost identical P1 proteins (Additional file 1 : Supplemental Figure S1), differential expression of p1 alleles could have evolved through changes in regulatory sequences, which control time-and tissue-specific p1 expression . Sequences containing regulatory elements are only determined for P1-rr [19, 20], but based on sequence similarities have likely the same function in other p1 alleles as well. While all known p1 alleles share the P1-rr promoter and proximal enhancer sequences, they differ in the sequence arrangement that contains the distal P1-rr enhancer. Comparing putative distal enhancer regions of p1 alleles reveals that the single P1-wr[B73] gene carries the simplest and therefore possibly the most ancestral form, which is confirmed by the presence of an almost identical enhancer region at the 3' intergenic region of the p2 gene in a wild relative of maize (Teosinte accession Zea mays ssp. parviglumis) . Complexity of this chromosomal region increased with P1-rw1077 and then P1-rr4B2. Therefore, we can use the changes in sequence organization to explain the origin of the P1-rw1077 and P1-rr4B2 enhancer region within the P1-wr repeat context, where the 3' end of one copy equals the 5' end of the downstream copy.
Whereas P1-rw1077 and P1-rr are identical in the initial 3' UTR, they diverge 35 bp following the stop codon. The next 13 bp (ATAATTGGGTCAC) in P1-rr originated from two separated P1-rw1077 sequences, 1,410 bp apart, implying a deletion event in P1-rr compared to P1-rw1077. The 13-bp (ATAATTGGGTCAC) insertion in P1-rr can be assigned to P1-rw1077 sequences upstream and downstream of the deletion site. ATAATTGGG is duplicated 59 bp downstream and includes the first two bp of the MULE TIR. Obviously, the adjacent TCAC occurs frequently within the P1-rw1077 sequence. However, the closest TCAC can be located 21 bp upstream of the insertion site. The 13-bp P1-rr sequence subsequent of the point of divergence with P1-rw1077 is suggestive of filler DNA, indicating that a previous DNA double-strand break in P1-rw1077 was restored by the NHEJ pathway. A tandem duplication of 1,269 bp that comprises the majority of both MULE fragments and 3' flanking enhancer sequences generated the current P1-rr 3' end and enhancer region.
In summary, DNA double-strand breaks in a P1-wr -like tandem array were probably repaired by NHEJ events that could have resulted in the rearrangements and duplications of enhancer-carrying sequences and consequently in novel p1 alleles as discussed below.
Although p1 and p2 share nearly the same coding sequences, their downstream sequences vary remarkably (Additional file 5 : Supplemental Figure S3). Sequence alignments of p1 and p2 alleles revealed that the P1-rr4B2 and p2 divergence from P1-wr[B73] is caused by transposon insertions. In P1-rr4B2, a Mu -like element was placed 109 bp downstream of the stop codon probably due to a deletion event (see above), and p2 alleles are followed by an Eninu retroelement 248 bp after the stop codon. The insertion sites close to the stop codon raise the question whether these transposable elements eliminated the transcription termination signals and the polyadenylation sites in the P1-rr4B2 and p2 3' UTRs. In general, the 3' UTR is also important for post-transcriptional regulation such as microRNAs and translational control, and gain or loss of cis elements within the 3' UTR could contribute to allelic diversity. Therefore, we decided to map the polyadenylation sites of these alleles.
PCR primers used in this report to amplify p2, p1 and flanking sequences and for 3' RACE.
p2 3' RACE
p2 race 5'-1
p2 race 5'-2
p2 race 5'-3
p1 3' race
p1 race 5'-1
p1 race 5'-2
p1 race 5'-3
3' flanking sequences
gene 1-2 f
gene 1-2 r
RT-PCR results indicated that p2[4Co 63], p2[P1-rr4B2] and p1/p2[B73] are expressed in silk tissue. Accordingly, we carried out 3' RACE experiments using total RNA from silk and three different primers (p2 race 5'-1 to 3, see Table 1), which hybridize to exon 3 of p2. The RNAs extracted from p2[4Co63], p2[P1-rr4B2] and p1/p2[B73] lines produced almost identical results with all PCR primers, which allows us to combine the data for ease of presentation (Figure 4B). We found 19 polyadenylation sites in a 218-bp interval that is located between 139 and 356 nt past the p2 stop codon. Whereas seven minor polyadenylation sites (adding up to 21% of the total events) are upstream of the retrotransposon, 12 sites lie within the LTR, including the major site (33% of polyadenylated p2 mRNAs), which is 269 nt from the stop codon and 22 nt into the LTR. The sequence alignment between p2 and P1-wr[B73] shows that the main polyadenylation site of P1-wr[B73] is 87 bp past the point of p2 and P1-wr[B73] divergence. The equivalent p2 fragment was displaced by retroelement insertions, and therefore cannot serve its original function. Nevertheless, p2 was able to recruit alternative polyadenylation signals and sites located mostly in the Eninu LTR.
Subsequently, we performed 3' RACE experiments on P1-rr4B2 total RNA extracted from silk and one primer binding (p2 race 5'-3, see Table 1) to exon 3. This exon contains the 3' UTR of the alternatively spliced P1-rr4B2 transcript, which encodes the functional P protein. We sequenced significantly fewer clones compared to P1-wr[B73] and p2 and obtained fewer polyadenylation sites. While polyadenylation sites are distributed over 403 nt from 143 to 545 nt measured from stop codon, the first site is used most often (31%) (Figure 4C). All seven polyadenylation sites are located in the MULE fragments, two within the TIR, the remainder in the transcribed part. Due to the partial deletion of the former 3' UTR alternative polyadenylation signals and sites had to be employed from adjacent sequences as described above. Note that the MULE borders P1-rr4B2 in opposite transcriptional orientation. A transcript from an intact member of the same MULE family could therefore produce antisense RNA that is complementary to P1-rr4B2 mRNA.
A distinguishing feature of the p locus is its tremendous allelic diversity, which makes it a preferable locus to study evolutionary changes and chromosomal dynamics on a larger and smaller scale. Although the grass family arose by an ancient whole genome duplication (WGD) event , the p gene has only a single ortholog in rice and sorghum, indicating that one copy was lost from the paleoploid ancestral genome. However, the more recent allotetrapolidization event, which formed the ancestor of maize about 5 mya , gave rise to two p copies located in the homoeologous regions of chromosomes 9 and 1. The copy on chromosome 1 was then duplicated in tandem 2.75 mya, thereby evolving into the current p2 (ortholog) and p1 (paralog) genes [8, 10]. The bulk of retrotranspositions in most grasses occurred more recently. A series of nested insertions that split approximately 80 bp of the p 3' UTR occurred between 1.4 to 0.2 mya . Although retroelements are highly repetitive in the genome, insertions of retroelements in a nested fashion create unique sequence junctions and become chromosomal markers . However, we do not know whether the retroelements transposed into the paralog or ortholog repeat, or maybe even into a later-generated copy. A model proposed for the evolution of single-copy alleles states that the retroelement insertion occurred in the 3' UTR of p2, thereby separating p1, which turned into P1-rr and P1-rw [8, 17] (Figure 3). In contrast, transposition into the 3' UTR of p1 retains the repeat structure and allows the amplification of additional copies by unequal crossover as suggested for the evolution of the multi-copy P1-wr[B73] allele  (Figure 3). Theoretically, only few recombination events are needed to transfer p1 and p2 sequences across the retroelement cluster. Therefore multi-copy alleles in a tandem array could have been created upstream, downstream or on both sides of the cluster simultaneously. These intermediate structures that enable us to discover the step-by-step evolution of all p alleles might still exist in the maize germplasm. Our current analysis allows us to present new and refined models for the evolution of p1-ww[4Co63], P1-rr4B2 and P1-rw1077.
p1-ww is a null allele because p1 specific sequences such as coding-, promoter-and proximal enhancer sequences are absent in 4Co63 (Figure 3). Is it possible that p1-ww[4Co63] represents a haplotype where the tandem duplication of the ancestral p gene never took place? According to this hypothesis, the nested retroelement insertions in 4Co63 that are identical to alleles containing p1 and p2 sequences must have happened before the p duplication event. However, since the p duplication occurred 2.75 mya, 1.37 million years before the first retroelement insertion, we can disregard this possibility [8, 10]. Thus, the alternative explanation that a functional p1 allele was deleted to give rise to p1-ww[Co63] is more likely. The p1-ww[4Co63] structure does not reveal the functional p1 allele(s) and their deletion or recombination events that resulted in the current null allele. Considering that p1 alleles are located on both sides of the retroelement cluster, multiple recombination events could have occurred to create the p1-ww[4Co63] allele.
p1-ww[4Co63] also could have evolved by a recombination event that involved two different p1 alleles. Unequal crossing over between p2 of P1-rr4B2 and p1/p2[B73] of P1-wr[B73] could have generated the current p1-ww[4Co63] structure and could have restored the p2 copy (Figure 7B). Even then, the deletion of the original paralog would have been derived from the P1-wr[B73] allele. Nevertheless, this could not have happened recently (on an evolutionary time scale) because of sequence polymorphisms in the participating alleles. Interestingly, both p1-ww[4Co63] and P1-wr[B73] carry as a signature the Shadowspawn retroelement in the same position, indicating that p1-ww[4Co63] most likely derived from P1-wr[B73] in multiple steps.
In addition, we can envision a P1-rw -like allele, which is similar to P1-wr[B73] in the distal enhancer structure. Such a P1-rw allele has been described . An unequal crossover between the large repeats flanking the coding regions duplicates the p1 gene or deletes the coding sequences, resulting in the p1-ww[4Co63] structure (Figure 7C). This scenario resembles the origin of p1-ww1112 . However, this model does not directly account for the Shadowspawn retroelement in p1-ww[4Co63]. All models demonstrate the complexity of the p locus and reveal the countless possibilities for recombination to occur whenever paralogous sequences are present.
Despite the repeat structure of the P1-wr[B73] cluster, a single P1-wr[B73] copy has the least complicated p1 allele composition, followed by P1-rw1077 and then P1-rr4B2. We hypothesize that P1-rw1077 originated from a P1-wr- like tandem array (Figure 5) because P1-rw1077 comprises a sequence fragment downstream of the p2 section that is virtually identical with the junction sequence of two P1-wr[B73] repeats in a head-to-tail assembly. This P1-wr tandem array could have been located on either side of the retroelement cluster.
A plausible sequence of events is as follows. A Mu -like element inserted into one of the P1-wr repeats 1,204 bp after exon 3 of the previous copy. Then an aberrant transposition event (abortive excision event) of this MULE caused a DNA double-strand break that enabled exonucleases to digest the unprotected DNA ends (Figure 5) thereby extending the gap into the adjacent P1-wr repeat. The deletion would have included MULE sequences (about 3.5 kb compared to a putative autonomous element) and almost the entire length of a P1-wr repeat (more than 12 kb). Non-homologous end-joining [21–23, 26], copying a 9-bp sequence (AACCTATGT) that is located 27 bp downstream from the deletion endpoint, must have repaired the break. Due to the nature of tandem repeats, the large deletion described above results in small repeats of 203 bp that are flanking the MULE fragments. Interestingly, this duplication is part of a 1.2-kb sequence that contains the enhancer element of P1-rr.
A single P1-wr- like allele downstream of the retroelement cluster that is flanked by large repeats due to the retroelement insertion in p2 could have been converted into a tandem array by unequal crossover between the large repeats (Figure 7C). Gene conversion events then could have transferred the altered region that originated at the 3' large repeat to the 5' large repeat where the distal enhancer sequence functions . Alternatively, P1-rw1077 arose from P1-wr repeats upstream of the retrotransposon cluster. The sequence 3' of this cluster, which corresponds to the 3' intergenic region of p2 as found in the P1-wr[B73] cluster and p1-ww[4Co63], is nearly identical with the 5' end of a P1-wr[B73] repeat over a stretch of 5.2 kb (Figure 2 and Additional file 6 : Supplemental Figure S4A). Due to this sequence similarity, a recombination event between p1-ww[4Co63] and the proposed P1-rw1077 precursor could have occurred that positioned P1-rw1077 downstream of the retrotransposon cluster (Additional file 6 : Supplemental Figure S4A). This arrangement assumes that the P1-rw1077 allele resembles p1-ww[4Co63] at the 5' end. After the recombination break point, P1-rw1077 has to be closer to P1-wr[B73] because, based on our model, P1-rw1077 is derived from P1-wr. Indeed, a sufficient amount of polymorphisms between the p1-ww[4Co63] and P1-wr[B73] alleles enables us to verify the predicted structure and to place the possible recombination site between 567 and 713 bp after the point of p1-ww[4Co63] and P1-wr[B73] alignment. Further recombination/gene conversion events contributed to the evolution of the present P1-rw1077 allele.
The presence of the MULE fragments and filler DNA in P1-rr in exactly the same sequence context as in P1-rw1077 agrees with our model that P1-rr continued to evolve from P1-rw1077. In our model for the origin of P1-rr, we propose a second DNA double-strand break (DSB) that occurred in P1-rw1077 in between the stop codon and the MULE insertion (Figure 6). In contrast to the first DSB, there is no evidence for the participation of a TE, leaving the cause for the DSB unknown. Exonuclease activities expanded the gap until both ends were joined in a NHEJ fashion by synthesizing two short DNA pieces (filler DNAs) from sites close to the deletion end points into the repair site. The DSB repair caused a deletion of 1,410 bp across the repeat junction that spanned almost the entire sequence from the stop codon to the MULE fragments. Interestingly, this intermediate P1-rr structure can be found at the 3' end of P1-rr1088, P1-rrCFS36 and P1-rwCFS342 .
The 5' transposon fragment happened to contain an 8-bp sequence close to the TIR (55-62 bp) that is present 1,269 bp further downstream as well. Unequal crossover between those 8 bp resulted in a tandem direct duplication of this 1,269 bp sequence. Accordingly, the final 318 bp of exon three, being part of the repeat, were replicated, too. A sequence at the 3' end of the first repeat was adopted as a splice acceptor site thereby generating a fourth exon. Although alternative splicing of exon 1, 2 and 4 has been reported, the protein product is of unknown function or may not have any function at all .
This putative evolutionary pathway explains how the P1-wr 3' UTR was almost entirely replaced by a MULE, how the fourth exon unique to P1-rr was generated and how the 1,269 bp SalI fragment containing the P1-rr distal enhancer was nearly completely duplicated (the initial 175 bp of the enhancer region are missing from the first repeat). Subsequently, gene conversion events could have placed part of the modified enhancer sequence of the downstream copy to the upstream large repeat . Alternatively, if this P1-rr module arose on the P1-wr[B73] side of the retroelement cluster as we also discussed for P1-rw1077, then a recombination event between p1-ww[4Co63] and the P1-rr ancestor could have transferred the P1-rr end to a position downstream of the retrotransposon cluster (Additional file 6 : Supplemental Figure S4B). The crossing over took place in the 595 bp sequence between the duplicated MULEs, which is why the repeat structure of P1-rr at the 5' end differs from the 3' end whereas they are identical in P1-rw1077. Lastly, a 1.6 kb hAT -like transposable element inserted 340 bp upstream of the MULE or 159 bp 5' of the enhancer region. This transposition did not occur in P1-rr1088 . Taken together, the novel distal enhancer structure of P1-rr could be the result of a MULE insertion and excision, deletion and repair by NHEJ, and duplication and deletion by recombination. This series of events from P1-wr to P1-rr confirms the sequential model of P1-rw and P1-rr evolution based on phylogentic analysis .
When the p1 paralog was formed, it probably included the complete p coding sequence and the basal promoter that controls p expression in silk tissue. Then the paralog acquired two additional regulatory sequences adding equally to the basal expression in pericarp and glume. The enhancer sequences were identified and tested in transient and transgenic plants using P1-rr fragments fused to a GUS reporter gene [19, 20]. A 1-kb sequence adjacent to the promoter contains a regulatory sequence termed proximal enhancer while a 1.2-kb fragment further upstream includes a distal enhancer (Figure 2). The proximal enhancer region corresponds mostly to a truncated MULE that captured part of a host gene in between the TIR . The proximal enhancer region and the basal promoter sequence are virtually identical in all sequenced p1 alleles to date (Figure 2). In contrast, the distal enhancer region varies in all p1 alleles as described above. Therefore, we hypothesize that the different spatial and temporal expression patterns of p1 alleles are caused by distinct distal enhancer regions . The distal enhancer as defined in P1-rr is located within a 1,269-bp Sal I fragment [19, 20], out of which 671 bp are derived form the Mu -like transposon (Figure 2). Although this MULE fragment is missing in P1-wr[B73], transgenes constructed from P1-wr upstream regulatory sequences linked to P1-rr cDNA produced red pericarp and cob glumes in transgenic plants , indicating that the enhancer sequence is included in the 589-bp region downstream of the MULE. Since this 589-bp region is duplicated in P1-rr, P1-rr has two enhancer sites that are separated by the MULE fragment. Additional P1-rr alleles, namely P1-rr1088 and P1-rrCFS36, were shown to have the same enhancer structure as P1-rr4B2 with exception of the missing hAT insertion in P1-rr1088 . Therefore, the hAT transposable elements inserted in the upstream copy of the enhancer region of P1-rr4B2 and P1-rrCFS36 obviously do not disrupt the enhancer sequence and function. Compared to P1-rr, P1-rw1077 has a deletion of 381 bp in the upstream repeat, which causes the loss of cob glume pigmentation . Interestingly, two additional P1-rw alleles, P1-rwCFS302 and P1-rwCFS342, lack the entire upstream repeat and the MULE fragment, thus having the identical enhancer arrangement as a single P1-wr[B73] copy . Taken together, the analysis of three P1-rr and three P1-rw alleles revealed that P1-rr alleles contain two copies of the specific enhancer sequence while P1-rw alleles only have one . Interestingly, this region coincides with a tissue-specific DNase I-hypersensitive site that remains closed in pericarp tissue of P1-pr, a silenced epiallele of P1-rr4B ; the P1-pr phenotype is shown in Figure 1A. It was reasoned that the upstream enhancer repeat that is missing in P1-rw1077 controls the glume-specific expression in a position-dependent manner, since the identical enhancer region is located 671 bp further downstream . An alternative explanation was prompted by the fact that p1 expression in pericarp is weaker and delayed in P1-rw1077 compared to P1-rr. We hypothesize that the transcriptional strength of p1 alleles is correlated with the enhancer copy number, which is supported by similar findings in human upstream enhancers . Consequently, P1-rw1077 produces less P1 protein than P1-rr in all expressing tissues. Also, each p1 allele is not expressed uniformly in female and male floral tissues within a plant. For example in P1-rr, p1 transcription is usually higher in pericarp than in cob glumes .
Therefore, we propose that the presence of only one distal enhancer site in the P1-rw1077 allele results in weak expression in pericarp tissue but no expression in cob glumes. Due to the duplication of the enhancer sequence as outlined in our model, p1 transcription in pericarp and glume tissue was equally elevated such that p1 is strongly expressed in pericarp and weakly expressed in glumes, thereby giving rise to P1-rr alleles. Note that comparisons with P1-wr alleles are not appropriate due to their post-transcriptional silencing, which potentially is repeat induced . This model is supported by an analysis of the spatial expression pattern in transgenic plants where various p1 constructs were expressed only in few p expressing tissues, resembling P1-rr or P1-rw phenotypes. It was shown in these transgenic plants that p1 expression follows a spatial hierarchy that begins with pericarp and continues with cob glumes, husk, silk, and tassel glumes in decreasing order [32, 33]. For instance, if the transgenes had been expressed in only one tissue, then it would have had to be in pericarp, in the case of two tissues then in pericarp and glumes, and so on.
Polyadenylation is involved in many facets of mRNA metabolism including enhancement of mRNA stability, transport of mRNA from the nucleus into the cytoplasm, and regulation of mRNA translation. Although polyadenylation signals in plants are less conserved than in mammals , three signals were identified in maize, rice, and Arabidopsis: the far upstream element (FUE, located -150 to -35 nt upstream of the cleavage site), the near upstream element (NUE, situated -35 to -10 nt upstream of the cleavage site) and the cleavage element (CE, positioned -10 to +15 nt upstream and downstreams of the cleavage site) [35, 36]. As we have shown above, a fragmented MULE was placed adjacent to the P1-rr4B2 stop codon possibly due to a NHEJ event. All mapped polyadenlation sites of the P1-rr4B2 transcript are located within the MULE sequence, indicating that P1-rr4B2 successfully recruited alternative polyadenylation signals in the transposon. Similarly, a Mu insertion in the 3' UTR of the rf2a locus also resulted in the adoption of new polyadenylation signals and sites . Retroelements, the most common transposons in maize, also insert in 3' UTRs without disrupting polyadenylation as demonstrated above for the p2 alleles. Our results suggest that polyadenylation in maize is a highly dynamic process which despite its importance for the cell is not tightly regulated. The large amount of polyadenylation sites found in our analysis of P1-wr[B73] transcripts that do not contain a transposon insertion supports this conclusion. A genome-wide analysis of genomic and transcript data could shed light on the mechanism of polyadenylation in maize and could establish the proportion of genes that terminate in transposable elements. Interestingly, it has been shown that many polyadenlylation signals in human and mouse genes have been derived from transposable elements . Besides polyadenylation signals, transcriptional as well as translational regulators have been identified in the 3' UTR of plant and animal genes, and their gain or loss could cause allelic diversity. For example, targets of microRNAs are often located in 3' UTRs [38, 39].
Recombination is crucial for the evolution of genomes [40, 41]. In particular, the non-homologous recombination pathway is frequently used to repair DNA double-strand breaks in somatic plant cells . Previously, we reported a probable NHEJ event involved in the formation of the P1-wr[B73] cluster  that produced a hybrid gene due to the ligation of deletion end points located within two genes. Similarly, deletions and repair by NHEJ in the above mentioned alleles could have resulted in the restructuring of an enhancer region and formation of a novel 3' UTR.
The exceptional allelic variation at the p locus prompts the question about its similarities and differences to genes that exhibit less variation. We propose that the main cause for the diversity might lie in tandem gene amplification [8, 17, 42, 43]. Once a gene underwent an initial tandem duplication, multiple unequal recombination events can follow as seen in the P1-wr[B73] multi-gene cluster . A single crossing over or gene conversion event between misaligned paralogous gene copies can generate many new alleles including deletion and amplification derivatives. Interestingly, in plants such events can occur mitotically and can be transmitted into the next generation, thereby increasing allelic variation . This explanation then implies that other loci exhibiting an increased allelic variation are multi-copy genes as well. Indeed, the complex r1 locus in maize is analogous to p1 in many aspects. The r1 locus, which also encodes a transcription factor, confers bluish anthocyanin pigmentation to various vegetative and floral plant tissues. Two r1 alleles, R-st and R-r, are molecularly well characterized. R-st contains various r1 genes, four of which are in tandem orientation . R-r consists of one complete and three truncated r1 genes that originated from tandem duplication [46, 47]. Comparable to p1 in complexity, both alleles undergo recombination and transposition events creating numerous derivative alleles. Paralogous gene copies in maize were also found at the pl1  and a1 loci . Especially the prolamine gene family with nearly 50 copies distributed over several chromosomes exemplifies the outcome of gene duplications . Actually, a large proportion of genes are tandemly duplicated in Arabidopsis, rice, and maize [51–53]. Considering the amount of paralogous sequences and their possibilities to recombine, a single reference genome providing just one allele can obviously not reflect this allelic potential of the maize genome. Not surprisingly, a recent genomic comparison between the B73 and Mo17 inbred lines  revealed a large quantity of copy number variations and presence/absence variations confirming previous results . Nonetheless, epialleles remain invisible in a traditional sequence comparison. Allelic diversity studies as presented here are essential for our understanding of the remarkably dynamic maize genome.
Allelic diversity is the source for evolution and domestication. While allelic variation in wild species ensures the best possible adaption to changing environmental conditions, humans have profited from allelic pools in crop plants by selecting phenotypic variations that best meet their needs. Alleles differ most often in small-scale nucleotide polymorphisms but also in large-scale sequence rearrangements. Maize has been shown to be a highly polymorphic species well suited to study genome dynamics and the underlying molecular mechanisms. In particular, the maize p locus with its well-established genetic history offers a tremendous amount of ancient allelic variations, some representing intermediate steps in large-scale sequence rearrangements. The tandemly duplicated p1 and p2 genes encode virtually identical Myb-like transcriptional activators, but only p1 controls the accumulation of reddish flavonoid pigments in maize female and male floral organs. Because all P1 proteins are almost identical, the phenotypic variation must be due to p1 regulation. Therefore, this locus represents an ideal example of how genomic rearrangements can contribute to novel regulatory elements.
Here, we used targeted genome sequencing to apply comparative genomics to the maize genome. Sequence alignments of orthologs and paralogs of different genotypes of a single genomic region allow us to reconstruct the repair of double strand breaks from transposition events within gene copies and their flanking regions. Such drastic invasions of new sequence elements in flanking regions result in the de novo creation of regulatory elements involved in the transcriptional and post-transcriptional regulation of gene expression that differentiate gene copies in their function. Interestingly, sequence chimerism in the 3' untranslated portion of the mRNA gave rise to multiple poly-A addition signals with similar strength, indicating a more relaxed sequence restriction of the 3' processing machinery than previously believed.
Seeds containing P1-rr4B2 and P1-rw1077 alleles, which were introgressed in a 4Co63 background, were thankfully provided by Tom Peterson, Iowa State University. The inbred lines B73 and 4Co63 carrying P1-wr and p1-ww alleles, respectively, were obtained from the Maize Genetics Cooperation Stock Center (maizecoop.cropsci.uiuc.edu) collection. Traditionally, p1 alleles are classified and named according to their pericarp and cob glume pigmentation, implicating that phenotypically similar but structurally different alleles share the same name. In this report, we use the inbred line where the p1 allele was originally described in as additional allelic designation such as P1-wr[B73] and p1-ww[4Co63]. Similarly, the inbred line will be used as allele description for p2, for example p2[4Co63]. Whenever the p2 source is unknown, the name of the linked p1 allele will be added to p2, such as p2[P1-rr4B2] and p2[P1-rw1077].
The inbred line 4Co63 contains a p1-ww allele, according to the colorless pericarp and cob phenotype of 4Co63 ears. We constructed a size-restricted lambda library using a lambda DASH II/Eco RI vector kit (Agilent Technologies) and Eco RI-digested 4Co63 genomic DNA. The lambda library was screened by hybridizing filters with probe 15 , which is derived from a distal enhancer fragment of P1-rr and is unique to p1 alleles. Two positively hybridizing lambda clones were isolated and subcloned into pBluescript II SK+ vectors (Agilent Technologies). Insert size and both end sequences of each clone were determined and found identical. A transposon minilibrary (Finnzymes) of one clone was constructed according to the manufacturer's instructions. Sequencing was performed with the ABI PRISM BigDye Terminator Cycle Sequencing Ready Reaction kit and an ABI 3730 capillary sequencer (Applied BioSystems). Sequence assembly and analysis were carried out using Lasergene (DNAstar) programs. Sequence gaps were closed by primer walking.
Genomic PCR was performed to amplify p2 alleles. PCR primers (see Table 1) (Figure 1B) were designed based on p2 sequences from p2[p1-ww1112] , p2/p1[B73] and p1/p2[B73] . The PCR-amplified products were cloned into pGEM-T Easy vector (Promega). The individual clones were completely sequenced using primers that are spanning the entire repeat length (approximately one primer every 300 bp, primer sequences available upon request). The sequencing reactions were carried out with the ABI PRISM BigDye Terminator Cycle Sequencing Ready Reaction kit and analyzed on an ABI 3730 capillary sequencer (Applied BioSystems). The sequences were assembled and evaluated with the Lasergene software (DNAstar).
The majority of P1-rr sequence was determined in P1-ovov1114 (orange variegated pericarp and cob) that is derived from P1-vv. The Ac element of P1-vv located in the second intron excised and reinserted 161 bp further upstream in the opposite direction , still allowing a considerable amount of phlobaphene accumulation in pericarp and cob. Similarly, P1-rr4B2 is a P1-rr revertant that also originated from P1-vv by Ac excision. When not otherwise specified, we use P1-rr (without additional allele designation) to refer to functional P1-rr alleles that are derived from the same P1-vv. Two Eco RI fragments, isolated from P1-ovov1114, were cloned in lambda using two Eco RI recognition sites outside of P1-ovov1114. The third site was provided by the Ac transposon . The 3' fragment of 14.5 kb was further divided in two plasmids, SA206 and PA103, which we gratefully received from Thomas Peterson. A transposon minilibrary of both plasmids (Finnzymes) was constructed as per the manufacturer's protocol. Clones were sequenced using transposon primers, ABI 3730 capillary sequencers, and the ABI PRISM BigDye Terminator Cycle Sequencing Ready Reaction kit (Applied BioSystems). Both plasmids contain 12418 bp non-overlapping P1-rr and 3' flanking sequences.
Genomic PCR was performed to amplify p1 intergenic region and flanking genes. PCR primers (see Table 1) were designed based on corresponding sequences from B73  and P1-ovov1114 (this study). The PCR products were cloned, sequenced and analyzed as described above for p2.
Total RNA was extracted from pericarp tissue 20 days after pollination and emerging silk with the RNeasy Plant Mini Kit (Qiagen). RNA was reverse-transcribed to cDNA using the GeneRacer Kit (Invitrogen) with the GeneRacer oligo(dT) primers. cDNA was PCR amplified with the GeneRacer 3' primer and a gene-specific primer (see Table 1). In general, 96 RT-PCR products per primer pair (but only 18 for P1-rr4B2 samples) were cloned into pGEM-T Easy vector (Promega) and sequenced with universal primers. DNA sequences were analyzed with Lasergene (DNAstar) software. Polyadenylation sites were only plotted in Figure 6A to 6C, when they occurred more than once.
The maize sequences were manually annotated using homology searches in various GenBank databases with multiple BLAST programs . The sequences were submitted to GenBank and were assigned following accession numbers: p2[4Co63] : HM454271, p2[P1-rr4B2] : HM454272, p2[P1-rw1077] : HM454273, p1-ww[4Co63] : HM454274, p1-ww[4Co63] 3' flanking region: HM454275, P1-ovov1114 3' end: HM454276
We thank Hugo Dooner for critical reading of the manuscript. We are grateful to Thomas Peterson for kindly providing SA206 and PA103 plasmids and P1-rw1077 seeds. This work was supported by the Selman A. Waksman Chair in Molecular Genetics to JM.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.