Human genome is a result of 109 years of evolution. It is very complex and in some respects it is still evolving. This contribution concerns evolution of the so-called Alu elements, which are movable sequences of DNA, very abundant in the human genome. We present a mathematical random process, the Griffiths-Pakes discrete-time branching process with infinite-allele mutations, which is almost ideally suited for modeling of Alu elements proliferation. For the biologically important special case of the linear-fractional offspring distribution we derive semi-explicit expressions for the expected frequency spectra of classes of alleles existing in a given number of copies (an analogue of the Ewens sampling formula). We compare the outcome with Alu-element statistics data.
Alu repeat sequences
Background on Alus
Alu elements belong to the group of transposable or mobile elements, which occupy nearly 45% of the human genome [1]. Within this group of transposable and also highly repetitive elements, LINEs (Long INterspersed Elements) and SINEs (Short INterspersed Elements) form the two largest groups. They occupy 21% and 13% of the human genome respectively [2]. Whereas the LINEs are dominated by L1 elements, the largest and hence most studied group of the SINEs is comprised of the Alu elements. While many transposable elements are present in all eukaryotic genomes, Alu elements appear only in mammals. A typical full-length Alu sequence is approximately 300 bp long. Alu sequences amplify by retrotransposition, also known as "the copy and paste" mechanism. At present it is estimated that more than one million copies of Alu elements occupy about eleven percent of the human genome, and the number of elements seems to be growing [1].
Alu elements are non-autonomous and seem to have to use the L1 elements' tools for retrotransposition. It has been hypothesized that L1 endonuclease causes a nick at the TTAAAA consensus site, after which Alu anneals directly to the site of integration [3]; then a second nick on the other strand completes the insertion. These two staggered nicks introduce an identifiable characteristic of Alu elements. The newly inserted Alu element is surrounded by an identical set of direct repeats, which are also called target site duplications (TSDs). These direct repeats range from 10 to 15 bp and are considered the prevalent feature of retrotranspositional insertion [4]. This process of integration, also known as target-primed reverse transcription (TPRT) [5, 6], is responsible for the successful amplification of Alu elements. At present it is estimated that more than one million copies of Alu elements occupy about eleven percent of the human genome, and the number of elements seems to be growing [1].
Based on diagnostic mutations, Alu elements are divided into subfamilies. The three major families of Alu sequences are J, S and Y. The letters are chosen in alphabetical order to convey the different ages of each family. Alu sequences in the J family are the oldest, while Alu sequences in the Y family are the youngest. The most interesting family in the current research of Alu elements is the Y family, which contains the youngest and most active Alu elements [7]. Due to their recent integration, 25 percent of their loci are still polymorphic [1]. An Alu locus is defined to be polymorphic if some individuals have an Alu element at that particular location while others do not. These polymorphic loci can be used as genetic markers for disease association studies.
Unlike Single-Nucleotide Polymorphisms (SNPs) Alu markers are small in numbers, but they are identical by descent and essentially homoplasy-free markers and their ancestral state, which is defined by their absence from a specific locus, is always known. Polymorphic Alu loci have been used in genetic diversity studies, forensic studies and disease association studies [8, 9]. Alu insertions have influenced the architecture of human genome by duplication, deletion, inversion, transduction and translocation [10]. Alu elements frequently appear in introns, 3' untranslated regions of genes, and intergenic genomic regions [11]. Alu insertions act as insertional mutagens and are responsible for 0.5 percent of human genetic disorders [12]. Almost all these diseases are caused by Alu elements from the youngest subfamilies [6]. For a comprehensive list of AluY disease loci and their associated diseases, one can also consult [6]. Their summary of Alu insertion induced diseases includes neurofibromatosis, hemophilia A and B, Huntington disease and Apert syndrome. Deiniger and Batzer [12] attribute diseases such as insulin-resistant diabetes type II, Lesch-Nyhan syndrome, Tay-Sachs disease, familial hypercholesterolaemia and -thalassaemia to Alu-mediated recombination. Additionally, several types of cancer, including Ewing sarcoma, breast cancer and leukemia are shown to be caused by Alu elements [1, 12].
Alu sequence data used in this study
Dr. Jerzy Jurka of the Genetic Information Research Institute (GIRI) kindly provided Alu sequence data for our analysis. All Alu subfamilies were extracted from the March 2006 assembly of the USCS Human Genome database. Only recognizable full-length Alu sequences were retained for analysis. Overall, Alu sequences for nine different Alu subfamilies were extracted from the USCS reference genome: AluYa1, AluYa4, AluYa5, AluYa8, AluYb8, AluYc1, AluYd2, AluYe2, and AluYe5.
The goal was to extract Alu sequences that belonged to relatively large subfamilies (more than 1000 sequences), such as AluYa1, AluYa4, AluYb8, AluYc1, and AluYe2. For each subfamily, a consensus or reference Alu sequence was used to screen the entire human genome for matching sequences. A match occurred when stretches of nucleotides that include the main diagnostic mutations agreed with the Alu subfamily consensus sequences. Since the insertion mechanism of an Alu element introduces large differences in their poly-A tails, these need to be deleted from analysis. Dr. Jurka provided the Alu sequence data with poly-A tails already deleted.
Alu sequences contain the middle A-stretch, another highly variable region similar to the poly-A tail, which lies between the two monomers that constitute an Alu sequence, and can be considered the A-tail of the first monomer. To accurately delete the middle-A stretch, it is necessary to align the Alu sequences for each subfamily. A consensus sequence for each subfamily was obtained from Repbase [13], a database of repetitive elements, which is maintained by GIRI. In each subfamily, pairwise alignment of each Alu sequence in the subfamily with the Repbase consensus sequences, was performed using ClustalW [14]. MEGA4 software [15] was used to display the alignments including the middle-A stretch. After deleting the middle-A stretch, the average length of an Alu sequence is about 260 base pairs.
Following preparatory steps described above, we obtained the counts of Alu sequences that had n identical copies in the sample, for n = 1, 2, 3,.... To obtain these counts for each Alu subfamily, a program was written in R-language. These counts or corresponding percentages represent final data, which were tested against the theoretical distribution based on the branching process model.