Skip to main content

Identification of conserved and polymorphic STRs for personal genomes

Abstract

Background

Short tandem repeats (STRs) are abundant in human genomes. Numerous STRs have been shown to be associated with genetic diseases and gene regulatory functions, and have been selected as genetic markers for evolutionary and forensic analyses. High-throughput next generation sequencers have fostered new cutting-edge computing techniques for genome-scale analyses, and cross-genome comparisons have facilitated the efficient identification of polymorphic STR markers for various applications.

Results

An automated and efficient system for detecting human polymorphic STRs at the genome scale is proposed in this study. Assembled contigs from next generation sequencing data were aligned and calibrated according to selected reference sequences. To verify identified polymorphic STRs, human genomes from the 1000 Genomes Project were employed for comprehensive analyses, and STR markers from the Combined DNA Index System (CODIS) and disease-related STR motifs were also applied as cases for evaluation. In addition, we analyzed STR variations for highly conserved homologous genes and human-unique genes. In total 477 polymorphic STRs were identified from 492 human-unique genes, among which 26 STRs were retrieved and clustered into three different groups for efficient comparison.

Conclusions

We have developed an online system that efficiently identifies polymorphic STRs and provides novel distinguishable STR biomarkers for different levels of specificity. Candidate polymorphic STRs within a personal genome could be easily retrieved and compared to the constructed STR profile through query keywords, gene names, or assembled contigs.

Background

Short tandem repeats (STRs), also known as short sequence repeats or microsatellites, are genome segments composed of short repeating sequences. The length of the fundamental repeat unit varies from one to six nucleotides [1]. STRs are highly abundant in many different organisms, and are distributed in both genic and intergenic regions[2]. Repeat structures expand or are deleted mainly due to replication slippage, which leads to highly polymorphic STR patterns among individuals [3]; these polymorphic features make STR motifs suitable genetic markers [4]. Several STR markers have been applied to individual/paternity identification and species/subspecies differentiation [5, 6], while some STRs are involved in gene regulatory pathways. Abnormal expansion of such functional STRs located within coding regions frequently cause various types of disease [7, 8]. Even when located within non-coding regions, STRs might also act as important functional regulatory elements [2, 9]. Therefore, discoveries of polymorphic STRs among different sequenced samples might be helpful for detecting useful genetic markers, while findings of well-conserved STRs might lead to their identification as functional elements for gene regulation networks.

In traditional approaches, genomic STR markers have typically been discovered by analyzing DNA sequences through in silico methods and verified by PCR [10]. Various in silico tools are available for detecting both perfect and imperfect STRs within a single species [11, 12]. In recent years, a revolutionary development in sequencing technology called next generation sequencing (NGS) has greatly impacted the growth and speed of genetic research. With relatively low costs and increased throughput, research at the genomic and transcriptomic levels has now become affordable and practical [13]. Traditional EST libraries applied to EST-STR discovery have been gradually replaced by NGS approaches, known as RNA-seq techniques, which provide extensive coverage at the whole-transcriptome level [14]. Recent publications have shown that NGS plays a low-cost and time-efficient role in polymorphic STR marker discovery, even without providing reference sequences [15]. The latest tools have also focused on STR marker discovery through NGS read analysis. For example, QDD is an open-source STR search tool package that provides a pipeline from raw NGS reads to STR identification and corresponding primer design [16]. Hoffman and Nichols also proposed a manual method for in silico STR marker screening [17]. Their experiments with Antarctic seals demonstrated the effectiveness of in silico STR marker discovery across individual NGS samples. The lobSTR program was developed by Gymrek et.al., who constructed a comprehensive survey of STR variations from NGS-derived personal genomes [18, 19]. An automated method for detecting STR polymorphisms from NGS data reads could utilize the high throughput advantages of NGS without the influence of manually examined factors. In addition, we also developed a prototype system for detecting polymorphic STRs within human genomes based on the conception of an STR template profile [20]. However, due to our limited knowledge, there are no online web applications that allow users to compare personal genomes or specify genes for a comprehensive STR analysis. Therefore, we sought to develop an efficient identification system that is capable of detecting conserved and polymorphic STRs across different individual sequence reads. The proposed method could detect STR polymorphisms without curated procedures, and could be directly applied for the efficient identification of conserved and polymorphic STR markers and accelerate functional analysis of regulatory STR motifs.

Results and discussion

We have performed a statistical analysis of the STR distributions in several datasets including chromosomal genes, combined DNA index system (CODIS) genes, disease-related genes, cross-species homologous genes, and genes that are unique to humans. The two major reasons for performing statistical analyses on different gene sets were: (1) to determine the most frequently appearing lengths of polymorphic/conserved STR patterns, and (2) the most frequently occurring regions of polymorphic STR motifs. To understand the extent of variation in the identified STRs, the distribution scale ranged from 1 to 84 bp. In addition, we selected the interval from 20 to 84 bp to analyze conserved degrees of identified STRs. Retrieved STRs from the different datasets are shown and discussed in the following sub-sections.

CODIS marker analysis

CODIS is a collection of investigated and verified DNA markers provided by the U.S. Federal Bureau of Investigation (FBI) to criminal justice services. Thirteen STR markers (within ten defined genes and three intergenic segments) were examined in this program. From the verification results, our proposed system could successfully detect and list all 13 STR markers from 7 genomes, including six individual genomes from the 1000 Genomes Trio Project and one from the Ensembl reference genome. All retrieved STR markers are listed in Table 1, and it should be noted that both STR markers within the gene vWA(ENSG00000110799) defined by CODIS contain multiple short repeat patterns, and the adopted CGSSR program could successfully identify the three STR markers "ACAG", "AGAT", and "TCCA" for vWA.

Table 1 Polymorphisms of CODIS STR markers for 7 different individuals.

It was also observed that most polymorphic STRs within a family group agreed with inherited characteristics, i.e., the daughter's alleles were inherited from either one of her parents. Based on CODIS STR markers for comparing these two families from the Trio Project, the results show that 7 of 13 STR loci displaying identical repeat patterns/numbers among all selected individuals, and only one or two STR patterns possessing minor differences in length could be found between parents and daughter in both families. These results strongly suggest that distinguishability at the individual level in the post-NGS era would likely improve if more distinct STR markers were added to support CODIS.

Disease gene analysis

To verify the accuracy and efficiency of the proposed system in detecting STR markers, we selected 13 well-known genes containing disease-related STRs. All identified STRs occurred in different genetic locations including protein coding regions, 5′ UTRs, 3′ UTRs and introns; large variations in repeat number might be causally related to serious genetic diseases according to previous medical reports. Table 2 lists all details including gene names, STR patterns and their genetic locations, expansion/deletion mechanisms, disease names, and references [8, 9, 21–30].

Table 2 A look-up table of genetic diseases, gene names, and corresponding STR patterns.

The polymorphisms of disease-related STRs within all individuals were detected and compared, as shown in Table 3. The results show that 10 of 13 polymorphic STRs among all 7 individuals could be identified, and most repeat numbers fall within the normal range. However, three STR markers could not be retrieved from two individual samples (shown as 0* in the two individual IDs NA19238 and NA12878). The unsuccessful STR detection was mainly due to missing nucleotides in the consensus sequences. Figure 1 displays the undetected STR patterns by showing alignment results of the target STRs and corresponding flanking sequences between the reference profile and individual sequences. We observed that the individual consensus sequences were filled with the character "N" at the expected repeat locations; this might be due to NGS sequencing flaws or errors caused by the sequence alignment map (SAM) tool consensus output data. These examples of unsuccessful detection also indicate that the performance of the proposed system depends on the accuracy of NGS sequencing and reconstruction processes. In Table 3, none of the remaining successfully retrieved STRs showed abnormal patterns consistent with lethal diseases. Most of these regulatory STRs were identified in all individuals and were matched with family inheritance characteristics. Nevertheless, from the resulting tables we observed that two STR patterns located within the coding regions of the DMPKand ARgenes were not consistent with heredity principles. This phenomenon might be a result of mixed sequencing data from heterozygous alleles. More recently developed assembly and reference mapping methods might be capable of distinguishing heterozygous alleles and overcoming such problems [31].

Table 3 Polymorphism of disease-related STR markers.
Figure 1
figure 1

Examples of undetected STRs for well-known genetic diseases. STRs within three genes including HTT, FRM1, and PABPN1 for two individuals (NA19238 and NA12878) could not be identified. The three bounding boxes represent the aligned results for three different genes. The unsuccessfully detected STRs in both HTT and PABPN1 genes were only for the NA19238 genome, and failed STR detection in FRM1 occurred for both NA12878 and NA19238 genomes. Flanking sequences of these STRs were well-aligned and are indicated using "*" symbols. It can be observed that the target STR was not detected due to absence of consensus STR segments (shown with the character "N"). Missing nucleotides might be caused by NGS sequencing issues or errors created during applying the SAM tool.

Polymorphic STR distributions for homologous and human-unique genes

To discover and distinguish important features of polymorphic STRs extracted from orthologous genes and human-unique genes within a human genome, we performed a statistical analysis of STR distributions from previously collected gene sets. In Table 4, the average occurrence rate of polymorphic STRs in all 225 homologous genes is 0.3216 (Polymorphic STRs/Mbp), which is less than the percentage of polymorphic STRs in 492 human-unique genes with a rate of 1.7020 (Polymorphic STRs/Mbp). This observation suggests that characteristics of STRs in homologous genes are highly conserved among various species. In other words, if homologous genes possessing highly variable STRs were conserved across species, this might lead to effects on important genetic functions. In addition, we compared the variation rates of CODIS STR markers, which were higher than the percentage of homologous genes but lower than the rates of human-unique genes. We speculate that the polymorphic STR patterns of these 492 human-unique genes should provide more identifiable STR markers than CODIS-selected genes, and might not be related to genetic functions for human beings or provide distinguishable features for different individuals.

Table 4 Occurrence rates of variation in STRs for 225 homologous genes, 10 CODIS genes (excluding 3 segments), and 492 human-unique genes.

To observe the levels of STR marker variation within homologous genes, we calculated maximum deviation (Max. Dev.) and average deviation (Avg. Dev.) in base pairs. The definitions of Max. Dev. and Avg. Dev. are denoted in Eq (1) and Eq (2), respectively. Max. Dev. represents the largest number of repeat differences (in bp) of a specified STR within the identical genes from any two individuals, and Avg. Dev. is obtained by taking an average-of-length difference (in bp) between all corresponding STRs within the identical genes from all possible pair combinations among 7 individuals.

M a x . D e v . a = M a x | a i S k | - M i n | a j S k | , i ≠ j , S k ∈ a i , a j for all k
(1)
A v g . D e v . = ∑ i , i ≠ j 7 ∑ j 7 | a i S k | - | a j S k | 7 2
(2)

where | a i S k | is denoted as the repeat length of the STR S k within the selected "a" gene from the ith individual, while | a j S k | represents for the jth individual.

We found that a total of 477 polymorphic STR patterns were detected in 492 human-unique genes, in which most of the patterns were located within "intron" regions. These results were similar to those for the CODIS STR markers. Additional file 1 lists the sorted STRs according to the Avg. Dev. and Max. Dev. To illustrate the differences in repeat length for each person, we selected two STR patterns with large differences among 7 individuals.

In addition, we selected examples of polymorphic STR patterns with family inheritance relationships from all detected STRs. Two aligned results are shown in Figure 2. It is interesting to observe in Table 1 that the polymorphic STRs from CODIS gene sets were well-conserved in different families and individuals: a total of 8 defined STR biomarkers within 13 genes displayed exactly the same repeat pattern and length, and only one or two polymorphic STRs could be identified between any two individuals. Hence, how to increase distinguishability at different levels becomes an interesting challenge. Here we illustrate two STR examples in Figure 2 that showed variations in polymorphic STRs at different levels; such STR motifs could be further experimentally evaluated and applied to identify different individuals or groups.

Figure 2
figure 2

Examples of different levels of polymorphic STRs. The STRs were retrieved from ENSG00000267127 and ENSG00000110799 for all 7 human genomes. NA12878 represents the CEU child, NA12891 the CEU father, NA12892 the CEU mother, NA19238 the YRI mother, NA19239 the YRI father, and NA19240 the YRI child. (a) Aligned polymorphic STR patterns and flanking sequences for ENSG00000267127, which is contained in the human-unique gene set. Left red box shows the differences between each family, while the right orange box represents inheritance relationships within a family (identical STRs for both mom and daughter). (b) Aligned polymorphic STR patterns and flanking sequences for ENSG00000110799, which is contained within the CODIS gene set. Similar conditions for a previous example and the highlighted segments in blue background represent aligned flanking sequences.

According to the STR variations among 7 human genomes, we tried to define 3 distinct types for comparing polymorphic STRs. The first type of polymorphic STR represents a set of suitable STRs for distinguishing each individual, including the query sequences coming from members of the same family. The second type of polymorphic STR demonstrates a set of identified STR biomarkers obeying inheritance and could be applied to different groups. The last type of specific STR provides a set of suggested STRs that reveal characteristics that are identical for the Trio families but different from the other groups. A total of 26 specific STR biomarkers were defined from the identified 477 polymorphic STRs within human-unique genes. Additional file 2 lists all 26 relevant STR biomarkers, of which 17 markers appeared as a type of single nucleotide polymorphism (SNP). All of these 26 STRs demonstrate relatively high potential as distinguishable STR biomarkers at different levels.

Polymorphic STR distributions in chromosomes

Polymorphic STRs identified from each chromosome were analyzed and compared for 7 individuals. The total number of genes is 56,852 in this study, of which 617 were not successfully detected due to serious sequence variations and/or query genes located at defined boundaries. Moreover, a successful rate of 98.91% was achieved for polymorphic STR analyses in this study. In addition, we did not consider STR motifs in the Y chromosome since it belongs exclusively to males. Hence, we only performed a statistical analysis of the STR distribution of both polymorphic and conserved STRs among all acquired genomes. Figures 3a and 3b show the distributions of polymorphic STRs and conserved STRs within each defined chromosome, respectively. In both figures, the x-axis represents the chromosome number, the y-axis represents the number of differences between varied STRs, and the z-axis denotes the accumulated percentages of polymorphic/conserved STRs in each selected chromosome. The highest bars in the last row (shown in light grey) in Figure 3a represent all accumulated percentages of polymorphic STRs for each chromosome, while the highest bars in the last row (shown in green) in Figure 3b represent all accumulated percentages of conserved STRs. The highest percentages of the corresponding bars from two figures should total 100% for each chromosome. For example, the percentage of polymorphic STRs in chromosome 1 is calculated by taking the total number of polymorphic STRs (TNpssr) within chromosome 1 divided by the total number of identified STRs (TNssr) within chromosome 1, and the average percentage of polymorphic STRs in the first chromosome for 7 human genomes is nearly 12.1%. Similarly, the average percentage of conserved STRs for chromosome 1 is obtained by taking the total number of conserved STRs (TNcssr) divided by the total number of identified STRs (TNssr). After taking the average from 7 human genomes, the percentage of conserved STRs is approximately 87.9% for chromosome 1. In other words, in chromosome 1, the total number of conserved STRs is more than 7-fold greater than the number of polymorphic STRs. It should be noted that the ratio of conserved STRs to polymorphic STRs is quite consistent for each chromosome, and the average fold change for all the different chromosomes is about 6.68.

Figure 3
figure 3

Average percentages of polymorphic and conserved STRs for the selected 7 human genomes. (a) Average polymorphic STR distribution in each chromosome. x-axis represents the chromosome number; y-axis represents the number of differences between varied STRs (bp) ranging from 20 to 84 bp; z-axis represents the accumulated percentage for each chromosome. The accumulated percentage of varied STRs was obtained by summing up average percentages from high to low variations for each chromosome. (b) Average conserved STR distribution in each chromosome. The x-axis represents the chromosome number; y-axis represents the number of conserved STRs (bp) ranging from 20 to 84 bp; z-axis represents the accumulated percentage for each chromosome. The accumulated percentage of conserved STRs was obtained by summing up average percentages from high to low conserved lengths for each chromosome

Furthermore, we also evaluated the total length of STRs (TLSTR), total length of selected genes (TLgene), total number of genes (TNgene), total number of STRs (TNSTR), total number of polymorphic STRs (TNpSTR), density of polymorphic STRs, and occurrence ratio of polymorphic STRs in each chromosome. These data are summarized in Table 5, which shows that the highest density of polymorphic STRs was found on chromosomes 19 and 20 (with 0.921 and 0.780 polymorphic STRs per Mbp, respectively), and the lowest density was observed on chromosome 3 (with 0.375 polymorphic STRs per Mbp). It should be noted that the occurrence ratio of polymorphic STRs in each chromosome is distributed evenly within the range from 11.57% to 14.73%. However, these data show non-random associations between STRs and genes that could be observed from the distributions of the number of STRs, the gene number and gene length on each chromosome. For example, the total numbers of STRs retrieved from chromosomes 19 and 7 are 23255 and 24975, respectively, but the total numbers of genes are 2901 and 2792, respectively. As another example, the total numbers of STRs retrieved from chromosomes 19 and 8 are 23255 and 19247, but the total gene lengths are 3074.71 Mbp and 5590.48 Mbp, respectively. Greater gene lengths or higher numbers of genes do not imply the existence of repeat segments.

Table 5 Comprehensive STR statistics for all chromosomes, sorted by polymorphic STR density.

Alternatively, highly variable STR patterns among 7 human genomes can be determined by assessing the extent of STR variations using a Manhattan-like scatter plot for all human chromosomes. The quality setting for all identified STR patterns is defined as 1.0 for this plot. Through the Manhattan plot (Figure 4), several polymorphic STR motifs exhibiting very high variation were readily apparent, and these extremely varied cases could be considered as the first choice for STR biomarker candidates. If a higher normalization threshold value for variation were assigned, fewer polymorphic STR biomarker candidates would be retrieved from the plot. For example, when the threshold value of variation was set to "6", the system replied with 5 important polymorphic STR candidates. These selected STR candidates are located within ENSG00000187627, ENSG00000233673, ENSG00000142453, ENSG00000154654, and ENSG00000029993 on chromosomes 2, 2, 19, 21, and X, respectively.

Figure 4
figure 4

A Manhattan-like scatter plot of all polymorphic STRs across the chromosomes of the human genome. The x-axis represents genomic coordinates of the chromosomes in sequential order. The y-axis includes normalized upper and lower bounds of varied repeat number among 7 individuals (represent by "-" in two different colors). Upper/lower bound is calculated by multiplying +/-1.25 to upper/lower quartile of repeat number and normalized by dividing median value among 7 individuals. The y-axis also includes normalized maximum and minimum varied repeat numbers among 7 individuals (represented by symbol "x" in two different colors). At a threshold value of "6", 5 polymorphic STR patterns (circled symbols ) were considered as important biomarker candidates in this example.

The ISP online web system

We designed a comprehensive web-based system called ISP for efficiently identifying polymorphic STRs among different individuals. Several useful functions were designed for users to retrieve and verify all potentially important STR biomarkers and compare personal STRs to 7 published human genomes. Users can enter an Ensembl gene ID, gene descriptions, gene names, or any related keywords, and the system immediately responds with query results for the appropriate gene selection. Users can then select an interesting gene and a pop-up dialog for STR quality and STR variation settings is displayed on the web page. For real-time analysis, only two quality values of 1.0 and 0.9 are currently available, and variation degrees are automatically decided and unlocked for selection depending on the selected genes. A quality of 1.0 indicates that all identified STRs are perfect STRs, while a quality of 0.9 indicates that an identified STR contains less than 10% noise including mutations, insertions, and deletions. Variation degree is calculated as the true difference in base pairs between any two polymorphic STRs.

In the proposed system, users can provide customized sequences for STR polymorphism analysis. Once the query sequences are uploaded, the system will apply BLAST+ to align the query sequence against the reference human genome. Once the query sequence is successfully aligned to one of the collected human genes, the newly identified STRs within the query sequence are compared to all 7 human genomes for polymorphic STR analyses. The query results are exactly the same as described above. Here, the threshold for identity in BLAST+ was set at 99%. Such a relatively high threshold value avoids ambiguous situations caused by non-human sequences. Finally, the compared results are displayed via a tabulated interface and sent via email. For security reasons the URL was designed with embedded encryption.

The system also includes four test gene sets including disease-related genes, CODIS genes, homologous genes, and genes related to a GO term of GO:0001501. Corresponding statistical reports stored in Microsoft Excel files are provided in the developed system. For online queries for interesting genes, users can click on the folder "ISP Datasets", and four different gene sets and their corresponding identified polymorphic STRs are available for each individual gene. To comprehensively analyze polymorphic STRs for all human genes, the folder "Chromosome Statistics" provides 23 Excel files, each of which contains the total number of STRs, total number of polymorphic STRs, total length of selected genes, total length of STRs, percentages of exact genetic locations of all detected STRs, percentages of different variation degrees for all polymorphic STRs, and two different degrees of STR quality (perfect STR and imperfect STR with less than 10% noise content). All these statistics can be downloaded directly from the interface. One example of the polymorphic STR distributions on chromosome 6 with perfect STR quality settings is shown in Figure 5. When comparing the yellow bars in the last row, the group percentage of polymorphisms of mono-nucleotide STR motifs appears with the highest gene number, while the tri-nucleotide STR motifs comprise the lowest percentage of genes. Polymorphic STRs located within the coding regions (the fifth position in each distinct fundamental pattern length of STR) exhibit the lowest rates since the variations appearing within translated proteins might lead to different protein structures and induce deleterious effects on protein function. The longest variation type of STR among the 7 human genomes is the di-nucleotide STR motif, which occurs within the intron regions of chromosome 6. Statistics for all chromosomes with different quality settings may be downloaded directly from the developed web site.

Figure 5
figure 5

An example of distribution profiles of polymorphic STRs on chromosome 6. x-axis represents 6 different lengths of STRs (from mono- to hexa-nucleotide repeat units) located within 6 different genetic regions of polymorphic STRs; y-axis represents the accumulated number of base pairs for STR variations (from SNP to 84 bp variation); z-axis represents the total number of genes having polymorphic STRs.

To comprehensively display the identified polymorphic STRs and provide detailed information on selected genes, the system has a look-up table. In this table, users can easily find detailed descriptions of the selected gene and identified STRs. This web page includes Ensembl gene ID, gene name, pattern(s) of polymorphic STRs, transcript ID(s), and STR locations within the corresponding chromosome. In addition, the system also provides sequence files for two assembled families and reference sequences from Ensembl. Because of alternative splicing mechanisms in genomes, genetic regions of identified STR patterns might be affected and result in different conclusions for different transcripts. To observe all possible scenarios, the system presents all polymorphic STRs according to transcript ID. Users can click on any transcript ID and the identified results are immediately shown on the web page. To rapidly identify polymorphic STR patterns, users can click on a detected polymorphic pattern within the gene information table to display a corresponding message that is annotated with the locations framed in red. To display global sequence alignments of the identified STRs among the 7 individuals, clicking on the identified STR pattern or "Alignment Result" automatically displays the alignment results. Through these alignment procedures, users can verify and understand the polymorphic distribution of STRs among sample genomes. The multiple sequence alignments are generated in the system by ClustalW [32].

Conclusions

In this study, an automated workflow for discovering STR polymorphisms from individual NGS sequencing data was proposed and the developed system is freely available at http://isp.cs.ntou.edu.tw/. The proposed algorithms started with performing reference mapping or de novo assembly of the imported NGS sequences, and the coordinate calibration was defined by mapping onto the Ensembl reference human genome. An integrated STR template profile was initially created to overcome the insertions and deletions that occurred in the reference genome or other target genomes. All possible polymorphic STR patterns could be detected automatically and precisely according to the aligned coordinate system. In this paper, polymorphic STRs from several different gene sets were applied to demonstrate the proof-of-concept, including the gene set selected by CODIS, the disease-related gene set caused by STR variations, the cross-species homologous gene set, and a human unique gene set as our evaluation datasets. In addition, all STR polymorphisms that were found within the 1000 Genomes Trio Project (6 genomes) were comprehensively identified and downloadable from the designed website. We also performed statistical analyses on both polymorphic and conserved STRs in each chromosome (except the Y chromosome), and occurrence frequencies for polymorphic STR variations between cross-species homologous genes and human-unique genes were compared for investigating the relationships between functional features or identifiable features of STR biomarkers. Therefore, STR variation frequencies for human-unique genes were clearly higher than those for cross-species conserved homologous genes, despite both gene sets exhibiting similar STR distributions and densities. This result provides an important implication in that mutations of STR elements tend not to appear within highly conserved genes among different organisms during evolutionary processes, and these cross-species conserved STRs could be considered more functionally related STRs. In other words, the polymorphic STRs that appeared within human-unique genes could be regarded as good candidates for identifiable biometric features. Focusing on the selected 477 polymorphic STRs from human-unique genes, three different categories were logically analyzed and suggested according to the 7 human genomes (considered as 3 different family groups and 7 individuals). Interestingly, we found some STR variation characteristics from human-unique genes possessing distinguishable features that could support CODIS STR verification. Furthermore, from genome-wide analysis and selection, we found a set of 26 polymorphic STRs retrieved from human-unique genes that displayed relatively higher distinguishability compared to other identified STRs. In order to understand the distributions of polymorphic STRs within each chromosome (except the Y chromosome), we compared densities of polymorphic STRs within each chromosome, and the results show that chromosome 19 had the highest density of polymorphic STRs, while chromosome 3 had the lowest density. The developed system has shown that our proposed methods could detect any polymorphic STR markers efficiently, and the proposed method could take advantage of NGS high-throughput sequencing technology and detect polymorphic STRs without manually curated and compared works. In order to efficiently provide a clear view of query results for polymorphic STRs for each gene, we have pre-processed all genes within all chromosomes (except the Y chromosome). Users will be able to perform customized sequence comparisons online for identifying all polymorphic STRs within a specified gene. In addition, users can upload their own query sequences to compare STR variations with 7 human genomes. We believe that the developed system can facilitate research involving the detection of novel STR biomarkers and the discovery of regulatory STR elements.

Methods

The 1000 Genome Project

To demonstrate that the proposed method is capable of detecting STR polymorphisms from NGS data, we have downloaded NGS genomic data from the 1000 Genomes Project as benchmark datasets. The 1000 Genomes Project is an ongoing international research project, the goal of which is to provide population-scale and high-coverage sequencing data world-wide. In 2010, the project completed its first phase, which included 3 pilot projects: the Low Coverage Project for providing low-coverage, whole-genome sequencing data from 179 people; the Exon Project for providing high-coverage sequencing data from 697 people, with sequencing regions limited to exonic regions of 906 randomly selected genes; the Trio Project for supporting whole-genome, high-coverage sequencing data from two families in different populations [33]. In the Trio Project, each family comprised 3 persons: father, mother, and daughter. The high-coverage sequencing data on the whole-genome scale suggested the Trio Project as an ideal sample resource for identifying various STR polymorphisms. The 1000 Genomes Trio Project files were downloaded from the NCBI FTP site in binary alignment map (BAM) format which is a de facto standard format for representing reference mapping results [34]. Because the files were retrieved from NCBI and mapped to the standard human genome sequences, the first step in our proposed method could be omitted. Instead, we applied SAM tools to transform the binary-archived BAM format into the plain-text SAM format, and applied the mpileup tool, which was bundled with the SAM tools to generate the consensus sequences for each individual in the Trio Project.

Ensembl Dataset

The human genome sequences of GRCh37 from the Ensembl FTP site were also downloaded as references, and Ensembl gene annotations from BioMart interfaces were retrieved to verify genetic locations of STR motifs [35]. In the developed system, the Ensembl human genome database from version 73 was applied for analysis. A total of 56235 genes were annotated, analyzed, and compared in this study.

CODIS markers

To verify the proposed method using previously published polymorphic STR motifs, we applied the well-known STR markers from the combined DNA index system (CODIS). The CODIS is a criminal forensic DNA database constructed by the U.S. Federal Bureau of Investigation (FBI). There were 13 highly polymorphic STR markers listed in the CODIS [5]. Each defined polymorphic STR marker within the collected 7 individual human genomes was retrieved and compared at different levels of specificity.

Disease-related STR markers

All STR markers collected in the CODIS system were linked with neither gene regulation networks nor genetic diseases. However, several STR variations have been verified as crucial factors in causing lethal diseases, and many of the identified STRs play important roles as regulatory elements that affect gene expression. Though there are no individual medical records available for the acquired Trio samples, these verified STRs were detectable and it could be used to determine whether polymorphisms of known disease-related STR motifs occur among different individuals in the Trio Project.

Homologous genes and human-unique genes

Quantity and quality of homologous genes provide powerful evidence for analyzing evolutionary relationships between two queried species. Investigation of STR conservation across different species has facilitated the discovery of functional STR motifs. Hence, we simultaneously collected well-defined homologous genes belonging to human, cow, dog, zebrafish, stickleback, macaque, mouse, medaka, tetraodon, and fugu as one of our experimental datasets. Through sequence alignment analysis and annotations from Ensembl, a total of 225 genes exhibiting orthologous relationships were collected, and these genes were applied to the analysis of STR polymorphisms within 7 human genomes. In contrast with the homologous gene analysis, we also collected human-unique genes by comparing all possible homologous relationships between human and the closest chimpanzee genomes. A total of 492 human-unique genes were collected for performing polymorphism analysis in this study. Polymorphic characteristics of identified STRs from human-unique genes among 7 different human genomes were considered potential candidates for STR biomarkers. To ensure the uniqueness of the collected genes, we further verified five mammalian species including gorilla, chimpanzee, macaque, orangutan, and mouse.

System Flowchart

An overview of the configuration of the proposed method is shown in Figure 6. Initially, the sequenced NGS genome datasets from different individual samples were provided as input data. There are 6 major steps that were designed for automated detection of polymorphic STRs. (1) Short reads were converted into consensus sequences in order to reduce computational complexity. There were two different standard processes for assembling NGS raw reads including reference mapping and de novo assembly approaches. Either approach or a combination of the two methods could be applied, depending on the target species for referencing sequences. (2) After extracting the consensus sequences from assembled contigs, each individual sequence was bias corrected, and its corresponding upstream and downstream segments were extracted. (3) Traditional in silico STR detection was performed on both reference and target individual consensus sequences to generate individual STR profiles. (4) Each individual sequence was aligned to a selected reference sequence in order to recognize and calibrate all corresponding locations of STR candidates. (5) Once a unified STR template profile could be constructed according to all previous STR profiles generated from imported NGS datasets, all potential polymorphic STRs were identified by automatically comparing the defined STR template profile against each individual target STR profile. (6) At the final step, a checking procedure was performed for evaluating overlapped and/or mis-recognized cases during STR retrieval processes under various parameter settings. The system was designed to include these overlapped candidates according to defined genetic locations and adjust the settings of retrieving modules (CGSSR [12]) in order to mine all possible STR patterns. The processes for each system module are described in further detail in the following sections.

Figure 6
figure 6

Configuration of the proposed online system. There are 6 major steps (represented in 6 different colors) in the designed workflow including the assembling of sequencing reads, sequence calibration, STR pattern extraction, mapping with the target reference, STR template profile construction, STR polymorphism detection, and merging neighboring STR segments.

Extract consensus sequence and reference assignment

The NGS datasets are usually composed of a large amount of short reads accompanied by information regarding sequence quality. The length of short reads were usually formed from tens to hundreds of base pairs depending on various NGS machines and protocols. Since the exact location of each read is unknown, assembly processes to reconstruct the correct gene sequences from these segmented short reads were essential steps prior to genomic analysis. There are two main types of reconstruction methods available under different circumstances. If the genome of the query organism has been sequenced and published previously, a reference mapping approach can be applied to assemble the sequence reads. Short reads are aligned to known reference sequences, and differences between reference sequences and query reads are annotated. This approach is usually applied to sequencing model species and medically related studies. On the other hand, if no reliable reference genome is available for the target organism, a de novo assembly approach should be applied to the sequenced short reads. A de novo assembly algorithm reconstructs the original sequencing reads using read contents only, which usually requires more computational resources. Many tools are publicly available for both reference mapping and de novo assembly [36].

At the first stage of our proposed workflow, sequence reconstruction was completed in a manner that depended on the origin of a specific sequence. The intermediate output at this stage was consensus sequences in the standard FASTA format that were extracted from mapped results or obtained from the de novo assembly tools. After extraction of a consensus sequence, a reference sequence was assigned as the central representation prior to subsequent mapping processes. For the assembled results obtained from reference mapping, the reference sequences could be automatically applied. However, the output results from de novo assembly, i.e., the reference sequences, were picked from the individual sequencing results. A standard quality indicator such as N50 or average contig lengths could be applied for the reference selection in general. In order to compare upstream and downstream regions of target genes, we additionally collected 7500 bp from either side of each gene.

Sequence calibration and upstream/downstream segment annotation

Although the gene sequences among the different individuals were highly similar to each other, coordinates of assembled sequences could not be directly applied across various sequencing datasets. This issue was mainly due to that random insertions or deletions caused by evolution, mutations, or reconstruction errors occurred during genomic sequence analysis. Hence, we employed NCBI BLAST+ programs to perform quick searches to further identify vague locations of the target samples [37]. Our purpose in this module was to align and correctly define both upstream and downstream segments of 7,000 bp in length for each assembled sequence. Two extra segments of 500 bp at both ends of the head of upstream and the tail of downstream segments were extended in the reference sequence to serve as key anchors for matching with all query assembled sequences and to calibrate sequence biases. After the preceding calibration processes, the extended segments with 500 bp located at both ends of the upstream and downstream regions were simultaneously discarded for all sequences. Therefore, each query sequence should contain the aligned upstream and downstream flanking segments on both sides.

STR discovery

There are several different tools available for retrieving in silico STRs [11]. The ideal tools for detecting polymorphic STR markers should support STR detection while allowing different tolerance types including insertion, deletion, and substitution. In this study, we adopted CGSSR as the STR retrieving tool. CGSSR is an STR discovery tool that was developed based on autocorrelation analysis, and the kernel algorithm supports all three different types of tolerance [12]. STR motifs retrieved from each individual sequence could be mapped to the coordinates on the reference by featuring globally aligned results that were generated in subsequent steps. In this study, the obtained STRs from CGSSR were set with two tolerance rates of 90% (imperfect) and 100% (perfect), and a minimum repetition length of 20 bp.

Mapping individual coordinates to the reference

For the problem of varied gene lengths mentioned in the previous section, sequence locations might appear as deviations within an STR profile. This location bias may lead to failed results in template-building procedures; thus, all corresponding STR segments among different individuals should be identified through an appropriate approach. Each sequence was therefore calibrated in advance regarding their system coordinates comparing to the assigned reference sequence through a global pairwise alignment. In this study, we applied the EMBOSS stretcher program to perform global alignment between the reference sequence and each individual target sequence. The aligned results were then employed as the data resource for coordinate transformation [38]. Each discovered STR record within an individual sequence was annotated with the information for corresponding locations in the reference gene sequence, repeat motif pattern, and repeat times. The collection of all mapped-coordinate STR records was finally defined as an STR template profile.

STR template and polymorphic STR construction

Since an STR profile contained all retrieved STR motifs from an individual genome under an identical coordinate system, the STR polymorphism could then be observed by comparing with all the remaining STR profiles. To efficiently and effectively list all polymorphic STR candidates, a representative and comprehensive STR template was built by union operations from the reference profile and all other individual profiles. It should be noted that all STR patterns were compared under rotational tolerance because the basic STR pattern might be shifted as a result of point mutations or insertion/deletion polymorphisms. Once the template profile was constructed, polymorphic STRs could be identified easily by comparing all STR records within the accumulated template profile against each individual STR profile. Since all coordinates of STRs were aligned to the reference sequence, the known gene annotations from the reference gene could be applied to each individual STR motif for assigning appropriate genetic location information. After constructing a comprehensive and annotated STR profile for each individual, we only have to judge the existence and the repeat number of a specific STR pattern at a corresponding location, and therefore the system could respond to a query in real time and verify all polymorphic STRs.

Merging neighboring STR segments

Due to mutations and gap noises that appear within a repeat DNA sequence, polymorphic STRs could be erroneously divided into several segments. This situation caused statistical errors during cross-sample comparison. In order to avoid such errors, the system provides a merging function for neighboring segmented patterns according to their patterns and overlapped conditions. The merging module could reunite disconnected STR segments into one motif according to previously defined coordinate information. Another potential problem is N/A nucleotides; these require adjusting one of the parameters in CGSSR to find shorter STR patterns that might not have been found in previous steps. Through this proposed mechanism, a comprehensive STR profile for each gene could be successfully constructed.

Function of comparing customized DNA sequences

To design an integrated system for customized services, the system provides users the ability to upload their own gene sequences and discover all polymorphic STRs against the benchmark human genomes. Once a customized sequence is uploaded, the designed system automatically blasts the query gene sequence against these genomes to identify its corresponding gene. The query sequence is then scanned to detect all STR motifs, and their corresponding STR profiles will be created according to previously introduced modules. The online system is freely available at http://isp.cs.ntou.edu.tw/.

Additional material

Additional file 1: Supplementary Document 1. A table of 477 polymorphic STR patterns retrieved from 492 human-unique genes. All related genetic information for each identified STR is described in detail.

Additional file 2: Supplementary Document 2. A table of 26 STRs selected from 477 polymorphic STRs based on specific conditions. All STRs were clustered into three different groups according to individual, family, and ethnic relationships.

Abbreviations

STR:

Short Tandem Repeat

pSTR:

Polymorphic STR

TLgene:

Total length of gene

TLssr:

Total length of SSR

TNgene:

Total number of gene

TNpssr:

Total number of polymorphism SSR

TNssr:

Total number of SSR

UTR:

Untranslated Region

Chr/Chrom:

Chromosome

BLAST:

Basic Local Alignment Search Tool

Mbp:

Mega base pairs

NCBI:

National Center for Biotechnology Information

ENBL:

Ensembl

CODIS:

Combined DNA Index System

NGS:

Next Generation Sequencing

BAM:

Binary Alignment Map

SAM:

Sequence Alignment Map

References

  1. Jurka J, Pethiyagoda C: Simple repetitive DNA sequences from primates: compilation and analysis. Journal of molecular evolution. 1995, 40 (2): 120-126. 10.1007/BF00167107.

    Article  PubMed  CAS  Google Scholar 

  2. Li YC, Korol AB, Fahima T, Nevo E: Microsatellites within genes: structure, function, and evolution. Molecular biology and evolution. 2004, 21 (6): 991-1007. 10.1093/molbev/msh073.

    Article  PubMed  CAS  Google Scholar 

  3. Ellegren H: Microsatellites: simple sequences with complex evolution. Nature reviews Genetics. 2004, 5 (6): 435-445. 10.1038/nrg1348.

    Article  PubMed  CAS  Google Scholar 

  4. Schlotterer C: The evolution of molecular markers--just a matter of fashion?. Nature reviews Genetics. 2004, 5 (1): 63-69. 10.1038/nrg1249.

    Article  PubMed  Google Scholar 

  5. Budowle B, Shea B, Niezgoda S, Chakraborty R: CODIS STR loci data from 41 sample populations. Journal of forensic sciences. 2001, 46 (3): 453-489.

    PubMed  CAS  Google Scholar 

  6. Balloux F, Lugon-Moulin N: The estimation of population differentiation with microsatellite markers. Molecular ecology. 2002, 11 (2): 155-165. 10.1046/j.0962-1083.2001.01436.x.

    Article  PubMed  Google Scholar 

  7. Andrew SE, Goldberg YP, Kremer B, Telenius H, Theilmann J, Adam S, Starr E, Squitieri F, Lin B, Kalchman MA, et al: The relationship between trinucleotide (CAG) repeat length and clinical features of Huntington's disease. Nature genetics. 1993, 4 (4): 398-403. 10.1038/ng0893-398.

    Article  PubMed  CAS  Google Scholar 

  8. Manto MU: The wide spectrum of spinocerebellar ataxias (SCAs). Cerebellum. 2005, 4 (1): 2-6. 10.1080/14734220510007914.

    Article  PubMed  CAS  Google Scholar 

  9. Toutenhoofd SL, Garcia F, Zacharias DA, Wilson RA, Strehler EE: Minimum CAG repeat in the human calmodulin-1 gene 5' untranslated region is required for full expression. Biochimica et biophysica acta. 1998, 1398 (3): 315-320. 10.1016/S0167-4781(98)00056-6.

    Article  PubMed  CAS  Google Scholar 

  10. Lovin DD, Washington KO, deBruyn B, Hemme RR, Mori A, Epstein SR, Harker BW, Streit TG, Severson DW: Genome-based polymorphic microsatellite development and validation in the mosquito Aedes aegypti and application to population genetics in Haiti. BMC genomics. 2009, 10: 590-10.1186/1471-2164-10-590.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Merkel A, Gemmell N: Detecting short tandem repeats from genome data: opening the software black box. Briefings in bioinformatics. 2008, 9 (5): 355-366. 10.1093/bib/bbn028.

    Article  PubMed  CAS  Google Scholar 

  12. Chen C, Chen C, Shih T, Pai T, Hu C, Tzou W: Efficient algorithms for identifying orthologous simple sequence repeats of disease genes. J Syst Sci Complex. 2010, 23 (5): 906-916. 10.1007/s11424-010-0203-2.

    Article  Google Scholar 

  13. Mardis ER: The impact of next-generation sequencing technology on genetics. Trends in genetics : TIG. 2008, 24 (3): 133-141. 10.1016/j.tig.2007.12.007.

    Article  PubMed  CAS  Google Scholar 

  14. Santure AW, Gratten J, Mossman JA, Sheldon BC, Slate J: Characterisation of the transcriptome of a wild great tit Parus major population by next generation sequencing. BMC genomics. 2011, 12: 283-10.1186/1471-2164-12-283.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  15. Yu JN, Won C, Jun J, Lim Y, Kwak M: Fast and cost-effective mining of microsatellite markers using NGS technology: an example of a Korean water deer Hydropotes inermis argyropus. PloS one. 2011, 6 (11): e26933-10.1371/journal.pone.0026933.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  16. Meglecz E, Costedoat C, Dubut V, Gilles A, Malausa T, Pech N, Martin JF: QDD: a user-friendly program to select microsatellite markers and design primers from large sequencing projects. Bioinformatics. 2010, 26 (3): 403-404. 10.1093/bioinformatics/btp670.

    Article  PubMed  CAS  Google Scholar 

  17. Hoffman JI, Nichols HJ: A novel approach for mining polymorphic microsatellite markers in silico. PloS one. 2011, 6 (8): e23283-10.1371/journal.pone.0023283.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  18. Gymrek M, Golan D, Rosset S, Erlich Y: lobSTR: A short tandem repeat profiler for personal genomes. Genome research. 2012, 22 (6): 1154-1162. 10.1101/gr.135780.111.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  19. Schbath S, Martin V, Zytnicki M, Fayolle J, Loux V, Gibrat JF: Mapping reads on a genomic sequence: an algorithmic overview and a practical comparative analysis. Journal of computational biology : a journal of computational molecular cell biology. 2012, 19 (6): 796-813. 10.1089/cmb.2012.0022.

    Article  CAS  Google Scholar 

  20. Sio CP, Lu YL, Chen CM, Pai TW, Chang HT: Mining Polymorphic SSRs from Individual Genome Sequences. Complex, Intelligent, and Software Intensive Systems (CISIS), 2013 Seventh International Conference on: 3-5 July. 2013, 570-575.

    Chapter  Google Scholar 

  21. Ranum LP, Day JW: Dominantly inherited, non-coding microsatellite expansion disorders. Current opinion in genetics & development. 2002, 12 (3): 266-271. 10.1016/S0959-437X(02)00297-6.

    Article  CAS  Google Scholar 

  22. Kanazawa I: Molecular pathology of dentatorubral-pallidoluysian atrophy. Philosophical transactions of the Royal Society of London Series B, Biological sciences. 1999, 354 (1386): 1069-1074. 10.1098/rstb.1999.0460.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  23. Tidow N, Boecker A, Schmidt H, Agelopoulos K, Boecker W, Buerger H, Brandt B: Distinct amplification of an untranslated regulatory sequence in the egfr gene contributes to early steps in breast cancer development. Cancer research. 2003, 63 (6): 1172-1178.

    PubMed  CAS  Google Scholar 

  24. Yu MW, Yang YC, Yang SY, Cheng SW, Liaw YF, Lin SM, Chen CJ: Hormonal markers and hepatitis B virus-related hepatocellular carcinoma risk: a nested case-control study among men. Journal of the National Cancer Institute. 2001, 93 (21): 1644-1651. 10.1093/jnci/93.21.1644.

    Article  PubMed  CAS  Google Scholar 

  25. Li JY, Popovic N, Brundin P: The use of the R6 transgenic mouse models of Huntington's disease in attempts to develop novel therapeutic strategies. NeuroRx : the journal of the American Society for Experimental NeuroTherapeutics. 2005, 2 (3): 447-464.

    Article  Google Scholar 

  26. Richards RI, Holman K, Yu S, Sutherland GR: Fragile × syndrome unstable element, p(CCG)n, and other simple tandem repeat sequences are binding sites for specific nuclear proteins. Human molecular genetics. 1993, 2 (9): 1429-1435. 10.1093/hmg/2.9.1429.

    Article  PubMed  CAS  Google Scholar 

  27. Brais B, Bouchard JP, Xie YG, Rochefort DL, Chretien N, Tome FM, Lafreniere RG, Rommens JM, Uyama E, Nohira O, et al: Short GCG expansions in the PABP2 gene cause oculopharyngeal muscular dystrophy. Nature genetics. 1998, 18 (2): 164-167. 10.1038/ng0298-164.

    Article  PubMed  CAS  Google Scholar 

  28. Matsuura T, Yamagata T, Burgess DL, Rasmussen A, Grewal RP, Watase K, Khajavi M, McCall AE, Davis CF, Zu L, et al: Large expansion of the ATTCT pentanucleotide repeat in spinocerebellar ataxia type 10. Nature genetics. 2000, 26 (2): 191-194. 10.1038/79911.

    Article  PubMed  CAS  Google Scholar 

  29. Ohshima K, Montermini L, Wells RD, Pandolfo M: Inhibitory effects of expanded GAA.TTC triplet repeats from intron I of the Friedreich ataxia gene on transcription and replication in vivo. The Journal of biological chemistry. 1998, 273 (23): 14588-14595. 10.1074/jbc.273.23.14588.

    Article  PubMed  CAS  Google Scholar 

  30. Sakamoto N, Ohshima K, Montermini L, Pandolfo M, Wells RD: Sticky DNA, a self-associated complex formed at long GAA*TTC repeats in intron 1 of the frataxin gene, inhibits transcription. The Journal of biological chemistry. 2001, 276 (29): 27171-27177. 10.1074/jbc.M101879200.

    Article  PubMed  CAS  Google Scholar 

  31. Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature genetics. 2012, 44 (2): 226-232. 10.1038/ng.1028.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  32. Thompson JD, Gibson TJ, Higgins DG: Multiple sequence alignment using ClustalW and ClustalX. Curr Protoc Bioinformatics. 2002, Chapter 2 (Unit 2): 3-

    PubMed  Google Scholar 

  33. Genomes Project C, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA: A map of human genome variation from population-scale sequencing. Nature. 2010, 467 (7319): 1061-1073. 10.1038/nature09534.

    Article  Google Scholar 

  34. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.

    Article  PubMed  PubMed Central  Google Scholar 

  35. Kersey PJ, Allen JE, Christensen M, Davis P, Falin LJ, Grabmueller C, Hughes DS, Humphrey J, Kerhornou A, Khobova J, et al: Ensembl Genomes 2013: scaling up access to genome-wide data. Nucleic acids research. 2014, 42 (Database): D546-552.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  36. Zhang W, Chen J, Yang Y, Tang Y, Shang J, Shen B: A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PloS one. 2011, 6 (3): e17915-10.1371/journal.pone.0017915.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  37. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL: BLAST+: architecture and applications. BMC bioinformatics. 2009, 10: 421-10.1186/1471-2105-10-421.

    Article  PubMed  PubMed Central  Google Scholar 

  38. Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends in genetics : TIG. 2000, 16 (6): 276-277. 10.1016/S0168-9525(00)02024-2.

    Article  PubMed  CAS  Google Scholar 

Download references

Declarations

The publication charges of this article were funded by the Ministry of Science and Technology, Taiwan (MOST 103-2627-B-019 -003 and MOST 103-2221-E-019 -037 to Tun-Wen Pai).

This article has been published as part of BMC Genomics Volume 15 Supplement 10, 2014: Proceedings of the 25th International Conference on Genome Informatics (GIW/ISCB-Asia): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S10.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tun-Wen Pai.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

CMC, CPS and TWP conceived the algorithms. CMC, CPS, and YLL implemented the algorithms, performed the experiments. CMC and CPS wrote the manuscript. TWP, HTC, and CHH evaluated the systems, and proofread and revised the manuscript. All authors read and approved the final manuscript.

Chien-Ming Chen, Chi-Pong Sio contributed equally to this work.

Electronic supplementary material

12864_2014_6755_MOESM1_ESM.XLSX

Additional file 1: Supplementary Document I: A total of 477 polymorphism STR patterns retrieved from 492 human-unique genes. Sorted by Avg. Dev. and Max. Dev.; Chrom is an abbreviation of chromosome; Enbl represents Ensembl; NA12878 represents CEU child, NA12891 for CEU father, and NA12892 for CEU mother; NA19238 represents YRI mother, NA19239 for YRI father, and NA19240 for YRI child. Region: 1=Intron, 2=Upstream, 3=Downstream, 4 = 5'UTR, 5=Coding, 6 = 3'UTR, 7=Undefined exon. (XLSX 43 KB)

12864_2014_6755_MOESM2_ESM.XLSX

Additional files 2: Supplementary Document II: A total of 26 STRs selected from 477 polymorphic STRs based on specific conditions. All STRs were sorted by different levets of family (Type I), individual (Type II), and ethnic (Type III). Chrom is an abbreviation of chromosome; Enbl represents Ensembl; NA12878 represents the CEU child, NA12891 the CEU father, and NA12892 the CEU mother; NA19238 represents the YRI mother, NA19239 the YRI father, and NA19240 the YRI child. The symbol of "*" in the GID column represents genetic variation of single nucleotide polymorphism (SNP). (XLSX 18 KB)

Rights and permissions

Open Access  This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, CM., Sio, CP., Lu, YL. et al. Identification of conserved and polymorphic STRs for personal genomes. BMC Genomics 15 (Suppl 10), S3 (2014). https://doi.org/10.1186/1471-2164-15-S10-S3

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1471-2164-15-S10-S3

Keywords