Short soybean sequences compared with A. thaliana
The importance, variety and number of short soybean sequences and the availability of well described A. thaliana genomic sequence were the primary factors involved in selection for this test. On average, the program compared over 160 Tbp/hr ((3,000 soybean cDNA records with an average of 461 bp each) × 119 Mbp of arabidopsis DNA) = 164.5 Tbp/hr). A total of 33,106 short soybean sequence matches with A. thaliana were reported from 341,619 unique records (9.7%). Of these matches, a total of 30,014 soybean ESTs (about 3,000 unique genes), 2,946 BES, 135 microsatellite, 6 FiS sequences, and 5 GmClone associations were reported. This association of genes and sequences between soybean and A. thaliana can serve several functions, including identification of gene-clusters within soybean, identification of potentially syntenic chromosomal regions, soybean expression estimation and single nucleotide polymorphism detection within soybean sequences.
Theoretically, a random 15 base pair match should occur every 1 Bbp. Previous work (unpublished) and the fact that the DNA sequence tested was primarily not random (i.e. ESTs) led to the multiple, local 7 bp match requirement. Each 7 bp (p = 1e-05) match was found within a 400 base pair region, and empirical analysis was used to determine that six of these additional matches were a practical minimum. In order to test the accuracy of the soybean vs A. thaliana matches, e-scores for 160 randomly distributed matches from each A. thaliana chromosome were generated by using NCBI's BLAST 2 Sequences (bl2seq) utility. As Figure 2 illustrates, increasing the number of additional 7 bp hits does increase the average e-score when comparing sequences from these two genomes. The number of additional 7 bp hits can be used post-processing, allowing the user to investigate the parameters required for acceptable similarity while only running the program once.
The five separate chromosomes of A. thaliana were put into Mapchart [24] format for presentation. Reported matches to each chromosome are presented in color, linear order maps (supplemental; B&W Figure 3). Genic regions reported for A. thaliana (30,694) are identified by dark (blue) segments and light (blue) descriptive text. Soybean EST and BES sequences are both in black text, with soybean markers presented in red.
Comparison with previously reported soybean-A. thaliana synteny
As previously noted, marker synteny between A. thaliana chromosome I and soybean linkage group A2 (including homeologous regions of A1, E and C2); and IV and A2 (including II/IV and J/L homeology) has been reported [17]. The inclusion of RFLP and simple sequence repeat (SSR) amplicon data from soybean allows limited comparison with this previous work. Although sequences from SSR amplicons did match to A. thaliana DNA sequence using this procedure, SSR sequences by definition contain repeats and usually yielded non-significant results when tested by the BL2SEQ utility [17]. When SSR containing BACs were analyzed (denoted in the map as a BAC plate location followed by SSR and soybean chromosome location), significance was similar to SSR amplicons, i.e. limited. Although both of these data types are included in the map, they were not as reliable as EST-based matches. It must also be noted that gene order between A. thaliana and soybean is not inferred by this research, only gene content.
An illustration of the effects of sequence size and comparison power was found in the comparison of the Satt238 amplicon (435 bp) with A. thaliana chromosome I (30,432,563 bp). The program reported a (23 bp) match at base pair 961,168. A comparison of the entire A. thaliana chromosome I (NC_003070) and this sequence (gi14969942) yielded no significant similarity. However, using a 2000 bp window of AtI DNA (960,000–962,000 bp) yielded an expect value of 1.0. Decreasing the A. thaliana DNA window to 961,150–961,190 bp yielded a 0.009 significance level. Finally, reducing the sequence to 961,163–961,185 yields a 0.001 result. Although previous analysis [17] was performed using tBLASTx, this type of analysis yielded no significant results for any of the above marker sequence comparisons.
Comparing the soybean gi sequences displayed in Figure 3b with the reported A. thaliana chromosome V region (15,426,000–15,429,000) yields four linear order groups. The first group, gi15200550–6913885, was similar to A. thaliana chromosome V (range e-67 – e-87), e-values between the soybean sequences are 0.0. The second group, gi10237097–19054121, was similar to chromosome V (range e-60 – e-83), e-values between the soybean sequences are e-112 – 0.0. The third group, gi16347707–4306793, was similar to chromosome V (range e-06 – e-15), e-values between the soybean sequences are e-163 – 0.0. The fourth sequence, gi5676855, was similar to chromosome V at e-20. The identification and clustering of similar soybean sequences based on their position on A. thaliana DNA was a result of this technique. Soybean sequences that grouped together frequently had bl2seq scores of 0.0, with values rarely above e-100. The ability to detect SNPs within these grouped soybean sequences is illustrated in Figure 3c.
Critical program parameters
The ability to start each check at an initial 5 bp identity increases processing speed by nearly 1000×. A short sequence advance value determines whether each 15 base pair fragment is advanced by one, two, or more bases for each screen of the model (A. thaliana) DNA sequence, with the potential of increasing the speed by reducing the number of sequence screens.
Although all soybean sequences were sequentially placed in relation to the A. thaliana genome, a limitation on map precision means that separation into uniquely delimited map entries does not occur within a 400 bp range. For example, in Figure 3b, a unique callout containing gi #s 15200550 thru 4306793 is presented. A gap of 124 bp existed between gi6913885 (15,427,045 bp) and gi10237097 (15,427,169 bp). In addition, a gap of 244 bp between gi19054121 (15,427,172 bp) and gi16347707 (15,427,416 bp) also exists. These gaps may represent weaker than typical links between grouped soybean records (although still significant matches at 4 e-158 and 4 e-18 respectively).