Multi-species sequence comparison reveals dynamic evolution of the elastin gene that has involved purifying selection and lineage-specific insertions/deletions

Background The elastin gene (ELN) is implicated as a factor in both supravalvular aortic stenosis (SVAS) and Williams Beuren Syndrome (WBS), two diseases involving pronounced complications in mental or physical development. Although the complete spectrum of functional roles of the processed gene product remains to be established, these roles are inferred to be analogous in human and mouse. This view is supported by genomic sequence comparison, in which there are no large-scale differences in the ~1.8 Mb sequence block encompassing the common region deleted in WBS, with the exception of an overall reversed physical orientation between human and mouse. Results Conserved synteny around ELN does not translate to a high level of conservation in the gene itself. In fact, ELN orthologs in mammals show more sequence divergence than expected for a gene with a critical role in development. The pattern of divergence is non-conventional due to an unusually high ratio of gaps to substitutions. Specifically, multi-sequence alignments of eight mammalian sequences reveal numerous non-aligning regions caused by species-specific insertions and deletions, in spite of the fact that the vast majority of aligning sites appear to be conserved and undergoing purifying selection. Conclusions The pattern of lineage-specific, in-frame insertions/deletions in the coding exons of ELN orthologous genes is unusual and has led to unique features of the gene in each lineage. These differences may indicate that the gene has a slightly different functional mechanism in mammalian lineages, or that the corresponding regions are functionally inert. Identified regions that undergo purifying selection reflect a functional importance associated with evolutionary pressure to retain those features.


Background
As part of the broader NISC Comparative Sequencing Program (see http://www.nisc.nih.gov), we generated sequences of the genomic region encompassing the locus commonly deleted in WBS in multiple mammalian species. The resulting multiple-sequence alignments provide the opportunity to examine properties of orthologous genes (such as ELN) that might clarify a functional role or contribution to human disease.
Elastin is a highly hydrophobic protein that plays a major role providing the property of elastic recoil in dermis, lungs, and blood vessels. Two types of alternating domains are characterized: (1) hydrophobic (rich in glycine, valine, and proline) and (2) crosslinking (rich in alanine and lysine) [1,2]. Both types of domains are encoded in distinct, usually alternating vertebrate exons, which may be subject to alternate splicing [3][4][5][6][7].
The single-copy ELN gene is located in a region of conserved synteny between human 7q11.23 and mouse chromosome 5G. This region is referred to as the Williams region [5,8] and encompasses roughly 30 genes. Genomic sequences of additional species such as baboon, cow, cat, and rat show conserved synteny around ELN (Pam Thomas, personal communication), thus indicating that the gross anatomy of the region does not contribute to any anticipated changes in ELN function between species. Moreover, when alignments are viewed at the nucleotide level, most coding exons in the large region display contiguous columns of nucleotides that are classified as a match or mismatch, but are devoid of gapped positions. Contrary to this common characteristic, two of the WBS genes, ELN [9] and WBSCR15 [10], show strikingly less sequence conservation, both at the nucleotide and amino acid levels. Differences in the patterns of their divergence have not been previously analyzed.
A second, non-conventional feature of ELN is found in its structure. The exon/intron ratio is unusually low, reflecting small exons interspersed within large introns [3]. This feature represents a contradiction to the observed correlation between GC-rich genomic sequences and short intron length [11].
Mutations of ELN are implicated in several human disorders, including supravalvular aortic stenosis (SVAS), Williams syndrome [12], and cutis laxa [13,14]. Disease phenotypes are associated with multiple types of sequence changes including large deletions, translocations, nonsense-, frameshift-, and splice-site mutations [15][16][17][18]. Previous studies that examined limited alignments in the 3' end of the ELN locus in five species noted particularly dense accumulations of repetitive sequences immediately downstream of the gene [4,5] that contribute to rearrangements in the primate lineage.
An analysis of multiple sequence alignments of the ELN gene from up to 9 species reveals the nucleotide substitution patterns at synonymous and non-synonymous sites. This approach identifies the presence of numerous inframe insertions/deletions (in/dels) within coding exons that have lineage-specific characteristics but do not diminish the overall hydrophobic character of the protein. Contrary to the expectation that genes with such a high level of nucleotide divergence would be undergoing positive selection, the remaining bases are highly conserved and are undergoing purifying selection. These results have implications for the use of animal models with ELN mutations as well as for studies of genomic evolution.

Gene structure
The expectation that orthologous coding exons in the 1.6-Mb genomic region commonly deleted in Williams Syndrome [9] align as highly conserved (gap-free) sequences among mammalian species is confirmed by the RFC2, LIMK1, and BAZ1B genes. ELN, however, is one example of a gene with a critical role in development that differs among its orthologs. Since the functional mechanism of the ELN protein is not fully elucidated, we sought to discern the nature of the divergence of this unusual gene. Although maintaining an overall interspersed structure of hydrophobic and cross-linking domains in all species, divergence occurs in the form of gaps in aligned coding exons and species-specific exons. The gapped alignment columns reveal a range of exon sizes among species (see Fig. 1 and supplemental Table A). Furthermore cow, pig, cat, and dog ELN genes have 36 exons, whereas mouse and rat ELN genes have 37 exons, due to an additional exon inserted after exon 4 (i.e., 4A). Some of these changes are recent; for instance the loss of two exons in human ELN (giving it 34 total) [19].
Alignment of human and mouse ELN cDNA sequences reveals 64.5% identity at the nucleotide level and 64.1% identity and 72.6% similarity at the amino acid level (with about 20% gaps; Table 2). This is well below the average percent identity of 85% at the nucleotide level and 78.5% at the amino acid level for human and mouse cDNAs [20] and is more similar to the average of 69% identity found in intronic regions. Although such divergence is typical of genes undergoing positive selection, these values are far below that of other genes in the region, with the exception of WBSCR15/Wbscr15 [10]. Rat and mouse ELN proteins share the most similarity at 91% consistent with the average percent identity calculated for 11,071 known mouse and rat cDNA orthologs of 92.6%.
The lower level of sequence conservation among ELN orthologs is also associated with differences in gene structure among species. Although the splice junctions in the human [16] and other mammalian ELN genes in all cases contain consensus GT/AG signals, they vary in their positions in the different orthologs. This leads to differences in the sizes of orthologous exons. For example, there are six ELN splice junctions that are not shared between mouse and human based on the whole-genome blastz alignments ( [21] shown on Fig. 2). These splice junctions are either altered by a nucleotide change, fall in a region of sequence expansion/contraction, or are not represented in the Multi-species alignment of ELN proteins from eight mammalian species Figure 1 Multi-species alignment of ELN proteins from eight mammalian species. Amino acids with similar chemical characteristics are color-coded (see notes below). Human, cow, mouse, and rat are derived from GenBank sequences; baboon, cat, dog, and pig are predicted from genomic sequences based on the similarity to human and mouse ELN genes. Color legend: H, K, R -polar/ positively charged amino acids; D, E -polar/negatively charged; N, Q -polar/amide; S, T -polar/alcohol; L, I, V -non-polar/ aliphatic; F, Y, W -non-polar/aromatic; A, G, P -other non-polar. Domain information is shown below the alignment; alternating cross-linking (designated as white boxes) and hydrophobic (yellow boxes) domains are shown. Exon borders are marked with black arrows at the top. Grey arrows mark the beginning of exons 4A (found in rodents) and 26A (human-specific, [4]), respectively.
Notes: H,K,R -polar /positively charged amino acids; D,E -polar /negatively charged; N,Q -polar /amide; S,T -polar /alcohol; L,I,V -non-polar /aliphatic; F,Y,W -non-polar /aromatic; A,G,P -other non-polar. Domain information is given below the alignment; alternating cross-linking (white boxes) and hydrophobic (yellow boxes) domains are shown. Exon borders are marked with black arrows at the top. Grey arrows mark the beginning of exon 4A (found in rodents) and 26A (human-specific, [4]), respectively. alignment because the exons failed to align. A closer look at these exons identifies a lineage-specific deletion in human ELN ( Fig. 2 top panel) that corresponds to mouse exon 26 (at the 5' end of the exon). Internal deletions of varying sizes are illustrated in Fig. 2, bottom panel. Additionally, the use of alternative splice junctions can contribute to variable exon size; for instance, an alternate 5' splice junction in human is also completely conserved in baboon (data not shown).
The splicing pattern in human ELN uses a uniform phase 1 intron selection [16]. Here, the splice acceptor site falls after the first nucleotide of the last codon in an exon and the splice donor site falls immediately prior to the last two nucleotides of the interrupted codon. Interestingly, this pattern is also found in the ELN gene of seven other mammals, while being underrepresented in most other genes. Examination of 35,657 genes from the known gene track of the UCSC Genome Browser [22] shows that introns end in a phase 0, 1, and 2 fashion 43%, 32%, and 24% of the time, respectively; these figures are fairly consistent with previous studies done on fewer genes [23,24]. The majority of genes, including many within the Williams region, use a mixture of these splice options; however, a minority of genes (2,422 or 6%) contains a uniform splicing pattern throughout. Of these, 42% are phase 0, 46% are phase 1, and 0.8% are phase 2. Such uniform phasing may increase the transcript diversity of a gene by providing options that create maximal flexibility in the alternative use of exons without disrupting the reading frame.

Sequence divergence among species
The numbers of synonymous and non-synonymous substitutions per synonymous and non-synonymous sites, respectively, were computed across the ELN gene using the Li-Wu-Luo method [25]. The complete-deletion option was used, thus only those codons shared among all species examined were included in the analyses. While the above findings suggested that ELN is under positive selection, studies of nucleotide divergence suggest otherwise, i.e., that ELN is evolving under strong purifying selection, as indicated by Ks being much higher than corresponding Ka values (Tables 3 and 4). Close inspection of the pair-wise and multi-species protein alignments  The majority of repeat elements in the elastin protein contain variations of PGVA or AAAAYKAA amino acid motifs that appear in the hydrophobic and crosslinking domains, respectively. There are many combinations of separated, tandem, and periodic repeats, present in lengths that vary from 5 to 41 amino acids; these data indicate that no particular repeat consensus is maintained. Furthermore, in contrast to deletions in the coding regions of the SCA and PRNP genes that align perfectly between human and mouse [26] the repeat elements in the elastin gene differ among species. within the same species. This suggests that these insertion events did not result from the expansion of trinucleotide repeats and/or that mutation and repair mechanisms are not fully balanced. Furthermore, the inserted codons in closely related species show convergent differences at synonymous positions. For example, exon 6 of mouse, rat, cow and pig have an alanine insertion, yet, the mouse and pig gene share a C at the third codon position, while the rat and cow share a T. The same site in dog and primates lacks this insertion. Similarly, in exon 8, the sequence from cow has adjacent GAGV-repeats; yet, at the nucleotide level, this segment exhibits three synonymous differences at the third codon positions. One possible scenario that led to these changes is shown in Figure 3D.
Thus none of the segments that contain repetitive variations of only few amino acids (e.g., PGVA or AAAAYKAA motifs) appear to be the result of internal duplications.  Amino acid and nucleotide sequence view of two insertion regions shared between several mammalian species Figure 3 Amino acid and nucleotide sequence view of two insertion regions shared between several mammalian species. Panel A corresponds to an A-insertion found in exon 6 of several species. Panel B corresponds to the GAGV-repeat found in exon 8. Panel C illustrates the synonymous nucleotide differences found in the GAGV-repeat in the cow exon 8. Dots indicate identity with the top sequence. Corresponding amino acids are shown in brackets []. Panel D shows an example of evolutionary scenario that describes the partitions found in exon 8. Tree topology was reconstructed using neighbor-joining method [41] with Jukes-Cantor distance [42], and rooted with rodent sequences. Numbers at the nodes are bootstrap values.   Both types of domains contain a large amount of hydrophobic amino acids (hydrophobic regions are glycine-rich and crosslinking regions are alanine-rich), and have minor differences in the amino acid content among species ( Figure 4). In particular, human and baboon have an excess of valine in their hydrophobic regions, whereas bovine has an excess of both valine and proline.

Genomic features
Several features of the ELN locus make it noteworthy. The genomic interval encompassing the gene has a high GC content (56%), which is notably higher than the human genome average of 41%. Similarly high GC content is seen for the orthologs in seven other vertebrates. The lowest GC content (in rodents, 50%), is still significantly higher than the mouse genome average of 42% [20]. Interestingly, the regions immediately flanking ELN show extreme dips in GC-content. For example, in mouse, a region of 42% GC is present 65 kb upstream of ELN, between two uncharacterized genes (matching pig_EST_BI340999 and mouse XM_132419.1). In human, GC content dips to 39.9%, 60-kb upstream of ELN. The regions of low GC content are also found to be rich in SINES, LINES, LTRs, and simple repeats (data not shown). Not surprisingly, several of the assembled sequences have contig gaps near the repeat insertions.
Ancestral repeats (ancestral repeat families are described in [20]) account for only 7% of the genomic interval containing the human ELN gene, whereas lineage-specific repeats account for 22% of interspersed repeat nucleotides. Typically, SINES are strongly biased toward regions of high GC content, as observed in the ELN locus. However, the relative abundance of SINES in the human (29%) and mouse (25%) ELN genes is higher than the genome averages (13.6% and 8.2%, respectively; [20]). The LTR content is highest in rat and mouse (although slightly lower than the mouse genome average), suggesting that these species have undergone the most insertion events.
Small scale deletion rates calculated from three-way alignments of rat, mouse and human show that rodents have fewer deletions (on the order of 10 bp) in the ELN locus than in the larger 1 Mb neighborhood (3.26 events per nucleotide for rat and 3.59 for mouse; versus 3.97 and 4.40, respectively [27]). Insertion rates, in contrast, are higher in the rat ELN locus (3.34) than in the 1 Mb neigh-

Figure 4
Compositional differences between hydrophobic (H) and crosslinking (CL) regions of ELN genes among eight mammalian species (H aa -CL aa , %). Using gap size of equal or less than 20 bp, we computed ratios of gap columns per coding nucleotides of the gene in pairwise human-mouse blastz alignments. Compared with the overall average ratio of 0.0218 gaps per coding nucleotide observed for human chromosome 7 genes, the ELN gene exhibited a striking ratio of 0.737 (species-specific exons were excluded). This observation is consistent with numerous small-scale gaps observed at the protein level in multi-species alignments (Figures 1 and 3). A similar analysis in WBSCR5 shows a ratio of .063, in which one insertion of SPGPV hydrophobic residues is found in the human protein, while the majority of the remaining gaps fall within one contiguous area where the mouse WBSCR5 gene is missing sequence orthologous to human exon 10.
To date, 32 SNPs have been mapped within a 40,513-bp region containing the human ELN gene. Only one of these SNPs is located within a coding region, representing a synonymous substitution in exon 20 (REFSNP # 2071307); this gives a frequency of 0.0004 SNPs per base pair in coding regions. The silent and non-coding polymorphisms do not appear to affect the function of the protein, in contrast to missense SNPs found in SVAS patients (with an occurrence of 1/20,000 births) [28], although their role(s) in the regulation of transcription cannot be ruled out. In SVAS patients, 20 different ELN exons show mutations that affect the reading frame [28]). Even at 1 mutation per exon, which is lower than observed, the resulting frequency of 0.009 changes per base pair differs from the frequency of synonymous changes by an order of magnitude. This suggests that although the ELN gene tolerates silent polymorphisms, mutations in the locus frequently lead to events that diminish the function of the protein.

Discussion
Initial comparisons of the human and mouse genome sequences revealed three classes of genes associated with divergent coding regions, those encoding extracellular, immune system, and reproductive proteins [20]. The findings we report here, showing marked sequence divergence with the ELN gene, are consistent with the elastin protein being a major component of the extracellular matrix in vertebrates. Furthermore, the differential alternative splicing seen with the gene might be used to tailor the structural function of the protein in different tissues. Several alternatively spliced variants are associated with disease phenotypes such as SVAS [28]. On the other hand, one of alternatively spliced variants in human possesses highly hydrophilic exon 26A that may play an important role through its interactions with other matrix macromolecules [4] and therefore is selectively maintained in the genome.
Analyses of the human ELN gene reveal a dynamically evolving region of the genome. For instance, mutant phenotypes associated with various SNPs in SVAS patients suggest that the gene is susceptible to mutation. Additionally, the excision of two exons in the 3' end of the primate gene indicates that the region has undergone recent recombination. Other mechanisms that provide diversity include species-specific alternative splicing and lineagespecific exons. Thus, there are multiple factors driving ELN diversity (e.g., SNPs, recombination between repetitive elements, expansion and contraction of in/del units). These findings are congruent with conclusions drawn in Watanabe et al. (2002) [29], where it was shown that early/late transitions in replication timing correlate with high SNP frequency, increased DNA damage, transitions in GC content, and concentrated occurrence of diseaserelated genes.
Multi-species nucleotide-level alignments alone do not provide sufficient information to annotate the coding sequence of the ELN genes, primarily due to splice junctions that align out of register among species. This pattern results from the expansion and contraction of in/dels, which makes the alignment output more complex to interpret. Furthermore, amino acid alignments show that the non-conserved amino acids are limited mainly to hydrophobic or neutral sites. Many of these hydrophobic residues are encoded by 4-fold degenerate codons that can expand to "12-fold" degenerate sites. That is, amino acids V, P, A, and G are encoded by 4-fold degenerate codons, which in the case of V, P, and G start with the nucleotide G, followed by any G, C, or T in the second codon position. Finally, 4 nucleotide choices are available in the third codon. This suggests that nearly any nucleotide substitution in these sites is acceptable for the function of the elastin protein, as long as it encodes one of these hydrophobic amino acids. Therefore, our results indicate that nucleotide divergence is generally tolerated, but is localized to specific domains and/or sites (i.e., mainly hydrophobic regions). A potential mechanism for incorporating runs of similar codons is slippage during replication [30]. However, diseases attributed to trinucleotide repeat expansion have runs of a single codon (polyglutamine; reviewed in [31]) and not a varying repeat unit, as seen with ELN.
It appears unlikely that the amino acid insertions occurred in an ancient lineage, since the number of duplicated in/ dels varies drastically among the modern sequences. The identity of the variable repeat unit resembles the recognition site of the elastin-binding protein, PGAIPG [32], located in human exon 16 [15]. This repeat is in a conserved position in cow, but not other species. Variations in the hydrophobic repeat sequence could create new recognition sites for this or other interacting molecules, provid-ing opportunities for modulating the level and type of interaction. For example, some ligands for the bovine elastin receptor are chemotactic for neutrophils and fibroblasts [33,34]. If this were the case for human elastin, then one would expect similar variability in other species. For instance VGVAGP is also recognized as a ligand in human, cow [32], and baboon [19], but is not conserved in other species. Other identified ligands for the elastin receptor in bovine (VGAMPG, VGMAPG, VGSLPG, VGLSPG, and GIAPG) are also chemotactic for neutrophils and fibroblasts [33,34], although they are not represented in the elastin protein. Since the ligand-binding sites are not conserved, it is likely that each species has its own recognition signal and the diversity of the repeats expands the repertoire of interacting molecules.
Beyond the coding region, evolutionary changes of the ELN gene are also occurring through insertion and recombination of repetitive elements. In general, the ELN gene in mammalian species contains a somewhat higher proportion of repeat elements that the rest of the genome, with numerous repeats concentrated ~70 kb upstream of the first coding exon (in a region of extremely low GC content). These repeat elements are related in the rodent lineage, but do not align to other species for which sequence is available (e.g., cat and human), implying that they occurred after the human-rodent speciation event. The cat sequence shows an abundance of LINE1 elements in its region of low GC content (data not shown), whereas rodent sequences show a similar distribution of SINEs, LINEs, and simple sequences. Coupled with evidence of recombination between ALU repeats in primate species [19], these observations suggest that the region may contain a hot spot for insertion of repetitive elements.
Our analyses indicate that although ELN appears to be accumulating mutations (both nucleotide substitutions and in/dels) in all mammalian lineages examined, the locations and identity of these changes vary among species, perhaps with different functional consequences. The fact that the majority of amino acid substitutions and in/ dels observed among eight mammalian species involve hydrophobic amino acids may reflect selective pressure to preserve the hydrophobic nature of the elastin protein. Furthermore, most amino acid substitutions do not change the chemical properties of the particular site, but rather interchange similar amino acids (e.g., exchange of glycine with valine or proline). Most of these evolutionary changes are concentrated in the hydrophobic regions of the protein, which represent a more flexible part of the protein (despite the overall hydrophobicity, these regions remain accessible on the protein surface [35]). Our results suggest that changes in the hydrophobic regions do not lead to major alterations to the structure and function of the elastin protein, as long as the overall hydrophobic properties are maintained. In contrast, crosslinking domains allow fewer substitutions and in/del events than hydrophobic domains. This can be explained by purifying selection that not only maintains the primary structure of the protein, but also affects the spatial arrangement of crosslinking regions (i.e., the main components responsible for elasticity). In short, the combination of purifying selection acting to preserve the hydrophobic nature of the protein and in-frame in/del events have shaped the evolution of the ELN gene in mammals. Similar multispecies comparisons are needed to establish if this pattern is specific for ELN or is more generally the case for genes encoding structural proteins.

Conclusions
Vertebrate ELN is a divergent gene that plays an important role in human health. Key structural and functional elements of the gene, including sequences of splice junctions and alternation of hydrophobic and crosslinking domains, are retained through purifying selection and remain virtually unchanged across large phylogenetic distances. However, remaining parts of the gene (such repeated PGVA or AAAAYKAA amino acid motifs, and others) are more flexible and subject to in/del events that are selected against in the majority of all other proteincoding genes in the mammalian genomes. The hydrophobic repeat elements that comprise the insertion elements are likely to play a role in diversifying the repertoire of the interaction domain of the elastin protein in each species.

Methods
A list of sequences used in this study is provided in Table  1. The multi-species genomic sequences were generated as part of the NISC Comparative Sequencing Program ( [36]; http://www.nisc.nih.gov; 2003 freeze). The human ELN gene structure was taken from [16]. Alternative splicing in human ELN is documented in GenBank reference XP_004897.2 and PIR entry EAHU.
Alignment of genomic sequences was performed using the PipMaker and MultiPipMaker servers running the blastz alignment program ( [21]). The comparisons were computed as local alignments using the default parameters of the program [21]. Pair-wise cDNA and protein comparisons used the global alignment program ALIGN, available at EBI, to calculate a percent identity and similarity. Ratio of gaps per coding region was computed using whole genome human-mouse alignments with blastz program [21].
The exon numbers for the ELN gene correspond to those of the human cDNA sequence in [16]. Exons that are absent in humans but present in other species are numbered and lettered consecutively (i.e., the last exon in bovine ELN is 36 and corresponds to human exon 34; see supplementary Table A). An additional exon is shared between the mouse mRNA sequence (NM_007925, [37]) and the rat mRNA (M60647, [38]) (see Figure 1). Calculations of mouse-rat similarity in coding sequences used data from reciprocal best matches available from the Homologene database.
Because of the low exon/intron ratio, automated gene prediction fails to find many exons in the ELN gene. In this study, exons were identified by homology at the nucleotide and amino acid levels, and around the putative splice sites [16]. Translated amino acid sequences were aligned using ClustalX [39], and then manually adjusted to preserve the exon correspondence among species. Nucleotide sequences from the corresponding coding regions were then aligned according to the amino acid alignment using the program BioEdit (T. Hall). The shading of the amino acid alignment that reflects chemical properties of amino acids (Figure 1) was performed using the program GeneDoc (Nicholas K.B. and Nicholas H.B. Jr.).
The number of synonymous substitutions per synonymous site (Ks) and the number of nonsynonymous substitutions per nonsynonymous site (Ka) were estimated using the Li-Wu-Luo method [25], as implemented in the program MEGA2 [40].