Large synteny blocks revealed between Caenorhabditis elegans and Caenorhabditis briggsae genomes using OrthoCluster

Background Accurate identification of synteny blocks is an important step in comparative genomics towards the understanding of genome architecture and expression. Most computer programs developed in the last decade for identifying synteny blocks have limitations. To address these limitations, we recently developed a robust program called OrthoCluster, and an online database OrthoClusterDB. In this work, we have demonstrated the application of OrthoCluster in identifying synteny blocks between the genomes of Caenorhabditis elegans and Caenorhabditis briggsae, two closely related hermaphrodite nematodes. Results Initial identification and analysis of synteny blocks using OrthoCluster enabled us to systematically improve the genome annotation of C. elegans and C. briggsae, identifying 52 potential novel genes in C. elegans, 582 in C. briggsae, and 949 novel orthologous relationships between these two species. Using the improved annotation, we have detected 3,058 perfect synteny blocks that contain no mismatches between C. elegans and C. briggsae. Among these synteny blocks, the majority are mapped to homologous chromosomes, as previously reported. The largest perfect synteny block contains 42 genes, which spans 201.2 kb in Chromosome V of C. elegans. On average, perfect synteny blocks span 18.8 kb in length. When some mismatches (interruptions) are allowed, synteny blocks ("imperfect synteny blocks") that are much larger in size are identified. We have shown that the majority (80%) of the C. elegans and C. briggsae genomes are covered by imperfect synteny blocks. The largest imperfect synteny block spans 6.14 Mb in Chromosome X of C. elegans and there are 11 synteny blocks that are larger than 1 Mb in size. On average, imperfect synteny blocks span 63.6 kb in length, larger than previously reported. Conclusions We have demonstrated that OrthoCluster can be used to accurately identify synteny blocks and have found that synteny blocks between C. elegans and C. briggsae are almost three-folds larger than previously identified.


Background
The conservation of large scale genomic sequences across two or more genomes -synteny blocks-is of primary interest because their identification sets up a stage for identifying and characterizing sequence and functional differences among genomes [1]. The term synteny has been used in different contexts in the past. Originally, synteny was used to indicate the colocalization of different genes in corresponding chromosomes of different species (a.k.a. "chromosomal synteny") [2]. Recently, with the availability of thousands of sequenced genomes, synteny has been used to describe the conservation of co-localized genes in the same order within different genomes (a.k.a "conserved segment"). In some occasions, the term "conserved synteny" has been used to refer a genomic region in which the chromosomal location of multiple markers is conserved, but not necessarily their precise order [3]. The term "synteny block" [4] has been defined previously as a segment in one genome that can be converted, through genome rearrangements, into a conserved segment in another genome. As such, a synteny block does not necessarily represent areas of perfectly continuous similarity between genomes. In this paper, we use the term "perfect synteny block" as "a genomic region of perfectly conserved gene content, order and strandedness", as defined by Coghlan and Wolfe [5]. As an extension to this definition, we use "imperfect synteny block" as "a genomic region containing some level of interruption, and in which order and strandedness is not necessarily conserved" [6].
In the past decade, different methods have been proposed to identify synteny blocks [7][8][9][10][11][12]. However, these methods usually lack one or more of the following functionalities required for detailed analysis: (1) Comparing more than two genomes, (2) Allowing interruptions within synteny blocks; (3) Capturing the strandedness of genes; and (4) Addressing one-to-many orthologous relationships. Failure to provide these functionalities makes these programs inapplicable for the identification of genome rearrangement events such as inversions, insertions, reciprocal translocations and segmental duplications. To tackle these problems, we have recently developed a new method called OrthoCluster, a computer program for the systematic detection of synteny blocks between two or among multiple genomes [6]. Briefly, OrthoCluster takes as input genetic markers (such as genes and microsatellites) and their relationships (such as orthologous relationships) and scans through two or more genomes for synteny blocks. OrthoCluster distinguishes genetic markers as either inmap or out-map. A genetic marker in one genome is called in-map if it has orthologous genetic markers in all corresponding genomes. In contrast, a genetic marker in one genome is called out-map if it does not have orthologous genetic markers in corresponding genomes.
To facilitate the application of OrthoCluster, we have recently developed a web server called OrthoClusterDB [13]. Additionally, a book chapter describing its usage and application has been published [14]. In addition to its use in identifying synteny blocks, OrthoCluster can be applied to identify segmental duplications within a genome [15].
C. elegans is a free living soil-dwelling hermaphrodite nematode and a popular model organism for biomedical studies because of its small size, transparent body, short life cycle, ease of propagation and compact genome. C. elegans was also the first multicellular organism subject to whole genome sequencing [16], and the genome sequence of this species has been declared to be complete, with no remaining gaps in 2002. After more than a decade of annotation after its first publication, the genome of C. elegans is arguably the best annotated of a multicellular organism to date [17,18]. The sequencing of its sister species Caenorhabditis briggsae, also a hermaphrodite, sets up an excellent platform for comparative genomic analysis [5,19]. Recently, by applying OrthoCluster, we have identified segmental duplications in the nematode Caenorhabditis elegans genome, including a large duplication that is polymorphic among C. elegans laboratory N2 strains [15]. In this project, we applied OrthoCluster to identify synteny blocks between C. elegans and its sister species Caenorhabditis briggsae, whose genome was sequenced a few years ago [19].
Synteny block identification and characterization is critical for understanding genome structure and functional domains of genomes. Synteny between C. elegans and C. briggsae was first explored when the first sequenced reads of C. briggsae became available. Using their program WABA (for "Wobble Aware Bulk Alignment") [20], Kent and colleagues compared the whole genome sequence of C. elegans and 8 Mb of C. briggsae sequences (in 229 cosmids) and found that 59% of these genomes are homologous at the base level, while 41% of the genome sequences are found in nonalignable regions. Using these alignments, they estimated the synteny relationship between C. elegans and C. briggsae and found that~40% of the genome is resistant to rearrangements. Later, using a gene-based approach, Coghlan and colleagues examined the slightly larger set of sequences (12.9 Mb of C. briggsae genome) for synteny blocks and genome rearrangement events [5] and found many perfect synteny blocks. They also identified larger imperfect synteny blocks between these two genomes with an average size of 53 kb. The completion of the C. briggsae genome sequencing project enabled the C. briggsae genome analysis group to compare C. elegans and C. briggsae at the whole genome scale at the supercontig level [19]. To identify regions of colinearity, the program WABA [20] was used to produce base level alignments, followed by merging of adjacent blocks and bridging of small transpositions and inversions. Eventually, 4,837 alignments were obtained that cover 84.6% of the C. elegans genome, with a median length of 5.6 kb (mean = 37.5 kb) [19]. The average size is smaller than that obtained using gene-based analysis reported previously [5]. Recently, a chromosomal-level assembly of the C. briggsae genome [21] has been constructed, which can be utilized to facilitate synteny identification and analysis. Here, taking advantage of this new assembly and our newly developed program OrthoCluster, we revisit and reanalyze synteny blocks between these two genomes.

Results
Initial comparison between C. elegans and C. briggsae genomes Using the C. elegans genome annotation in WormBase release WS180 [17], the genome assembly and annotation of C. briggsae [21] (from the same release), and the correspondence file generated using InParanoid [22], we detected 3,075 perfect synteny blocks between the genomes of C. elegans and C. briggsae using OrthoCluster. These blocks range in size from 2 to 28 genes (961 bp to 168.2 kb, Figure 1b). Examination of these synteny blocks, including the gene models contained within these blocks, immediately suggests that many gene models (primarily the C. briggsae ones) are defective, which leads to the unnecessary truncation of large synteny blocks. One example of such case is shown in Figure 2, which illustrates two genomic regions in C. elegans and C. briggsae that are nearly perfectly conserved with the exception of one gene in C. elegans, B0240.4, which breaks the synteny. Based on the current WormBase annotation (WS180), this gene does not have a clear ortholog in C. briggsae. Examination of the alignment of genes B0240.4 and B0240.2 in C. elegans and gene CBG23278 in C. briggsae (which is the predicted ortholog of B0240.2) suggests that the predicted C. briggsae gene is defective. Indeed, the current gene model of CBG23278 can be split into two separate genes, one orthologous to B0240.4 and the other orthologous to B0240.2. Experimental validation based on PCR reactions that prove the existence of the two genes and the non-existence of the junction on a cDNA library for C. elegans suggests that these are two separate genes (data not shown). Fixing cases like this will uncover many more bona fide orthologous relationships between C. elegans and C. briggsae.

Synteny-based gene model correction and ortholog assignment
We developed a procedure (described in detail in Methods) in order to detect and correct defective gene models at the whole genome scale. Altogether, we identified 52 putative new genes in C. elegans (Table 1, Additional file 1). In contrast, in C. briggsae, we have generated 582 revised gene models, 191 of which correspond to novel gene structures in previously defined intronic or intergenic regions (Table 1, Additional file 2). Most deletions and additions were due to gene splits and gene merges ( Figure 3). We assigned new orthologous relationships based on sequence similarity revealed by the improved gene annotation and synteny, which leads to the assignment of 949 new orthologous relationships ( Table 2).  Figure 1 Perfect synteny blocks in the C. elegans genome. a) Size distribution for perfect synteny blocks obtained using the improved annotation. b) Size distribution for perfect synteny blocks obtained using the WS180 annotation. c) The largest perfect synteny block between C. elegans and C. briggsae obtained using the improved annotation.
Vergara and Chen BMC Genomics 2010, 11:516 http://www.biomedcentral.com/1471-2164/11/516 Genome-wide identification and analysis of synteny blocks Orthologous relationships Based on the improved orthologous relationships (see Methods), the majority of the orthologous relationships between C. elegans and C. briggsae are one-to-one relationships (Table 3), with only 7.9% of the C. elegans genes with orthologous relationships (or 5.8% of the total genes in the improved annotation of C. elegans) having more than one ortholog in C. briggsae, ranging from 2 to 147 orthologs. Likewise, 8.3% of the C. briggsae genes with orthologous relationships (or 6.2% of the total genes in the improved annotation of C. briggsae) have more than one ortholog in C. elegans, ranging from 2 to 24 orthologs. One-to-one orthologous relationships exist mainly between homologous chromosomes of C. elegans and C. briggsae (Table 3), demonstrating strong chromosomal synteny, in good agreement with previous studies [21].

Perfect synteny blocks
Using OrthoCluster and the improved genome annotations, we identified 3,058 perfect synteny blocks (each synteny block contains at least two genes and no mismatches). Of these blocks, 2,687 are non-nested, whereas 371 are nested within larger synteny blocks. A nested synteny block corresponds to a subset of genes within a larger synteny block that is found duplicated in different genomic regions in either the same or different C. elegans C. briggsae  Genes deleted because of special cases 0 7 Predictions added because of special cases 0 5 Final number of genes 20,192 19,717 chromosomes. The largest perfect synteny block between the genomes of C. elegans and C. briggsae contains 42 genes ( Figure 1a, Figure 4) and spans a 201.2 kb genomic segment in Chromosome V of C. elegans, corresponding to a 202.5 kb segment in Chromosome V of C. briggsae (Figure 1c). The mean size of these perfect synteny blocks span 18.8 kb, while the median size is 12.7 kb. Altogether, the perfect synteny blocks cover 11,058 genes in C. elegans (51.3 Mb, or 51.1% of the C. elegans genomic sequence) and 10,879 genes in C. briggsae (49.5 Mb, or 45.6% of the C. briggsae genomic sequence) ( Table 4). Genome-wide view of synteny blocks can be generated using OrthoClusterDB [13] (Additional file 3, Figure S1). Most   [19,21], suggesting that Chromosome X is subject to fewer rearrangement events. Alternatively, most rearrangements occurring in Chromosome X are lethal and are therefore not preserved in evolution.
Taking the definition of clusters and arms provided by Hillier and colleagues, we find that, within autosomes, the median length of perfect synteny blocks in autosomal centers is 11.6 kb (mean = 16.6 kb), whereas the median length of perfect synteny blocks in autosomal arms is 12.2 kb (mean = 16.9 kb). This difference is not statistically significant (p-value = 0.15, Mann-Whitney Test). Among all six chromosomes, the one with the highest genomic coverage is Chromosome X (65.4%). Chromosome V, which is the largest chromosome in C. elegans, also contains the largest number of blocks (22.6%). Species-specific gene family expansions/contractions were observed previously and many gene family members have been found to form tandem clusters in C. elegans and C. briggsae [19,23], which is consistent to our recent observation that the C. elegans genome harbors a large number of intrachromosomal duplications, many of which occur in tandem [15]. In this project, we have demonstrated that members of a same gene family can form tandem clusters within synteny blocks identified using OrthoCluster. We found 534 such cases, in which 424 contain more genes in C. elegans while 110 have more genes in C. briggsae within these tandem gene clusters. One example of this is a syntenic region that has a higher presence of members of the GST (glutathione-S-transferase) family of genes in C. elegans than in C. briggsae (Additional file 4, Figure S2). Further exploration of these regions is required to unveil the mechanisms underlying the expansion/contraction of these genes.
Our gene model improvement has greatly enhanced our ability to identify larger synteny blocks. When we use the WS180 annotation (before gene model improvement) for the detection of perfect synteny blocks, we found more (3,075) but smaller blocks (Figure 1a, b; Additional file 5, Figure S3; Additional file 6) compared to those described above. For example, the largest synteny block contains 42 genes using the improved annotation, but only 28 genes if we use the WS180 annotation. In fact, the 28 genes are a subset of the synteny block composed of 42 genes detected using the improved annotation. Compared to the WS180 annotation, the improved annotations increase the coverage of the chromosomes (Additional file 6). Numbers in parenthesis represent relationships found in the "_random" assembly of the chromosome, as reported by Hillier and colleagues [21].

Non-operonic Blocks
Operonic Blocks Contribution of operons to perfect synteny blocks According to WormBase annotation (release WS180), there are 1,120 operons in C. elegans, ranging in size from two to eight genes (Table 4). Previous comparative studies have concluded that these operons are highly conserved between C. elegans and its sister species C. briggsae, with the vast majority of the operons (96% [19] and 93.2% [24]) conserved between these two species. What is the contribution of operons to the perfect synteny blocks identified between these two species? In order to address this question, we have examined the contribution of operons to perfectly conserved synteny blocks (Table 4, Figure 4). Our analysis suggests that operons constitute an insignificant part of the perfect synteny blocks. First, the portion of the C. elegans genome covered by the 1,120 annotated operons (9.8%) is dramatically smaller than that covered by the 3,058 perfect synteny blocks identified in this study (as shown above, 51.1% genomic coverage). More recent studies have shown that operons are not as conserved as previously reported and that there is a greater turnover of operon composition among Caenorhabditis species [25,26], suggesting that the contribution of operons to the perfect synteny blocks between C. elegans and C. briggsae is even lower.
Second, if we define an operonic synteny block as a perfect synteny block with at least half of its genes being conserved operons, we find 385 such operonic synteny blocks (Figure 4). These operonic syntenic blocks contain 498 operons (or 44.5% of the total operons). These 385 operonic synteny blocks cover only 7.4% of the C. elegans genome, still much smaller than the 51.1% of the C. elegans genome covered by all perfect synteny blocks.
Third, the limited contribution of operons to the observed synteny is further illustrated by the low coverage of the X Chromosome by operons (2.1%, 57 operons) in C. elegans, which is the chromosome that is most covered by perfect synteny blocks (65.4%, 431 perfect synteny blocks) between C. elegans and C. briggsae (Table 4).

Imperfect synteny blocks
During evolution, genome sequences are often interrupted by small genome rearrangement events such as insertions, deletions, inversions and duplications. It has been suggested that small inversions and transpositions can be regarded as noise in genome rearrangements [27]. Identification of imperfect synteny blocks is valuable because they provide a global view of the existing synteny between different species for regions that have been subject to various types of rearrangement events. To detect such synteny blocks, we ran OrthoCluster by allowing mismatches (see methods) as well as by  Numbers in parenthesis correspond to the number of synteny blocks found in the "_random" assembly of the chromosome, as reported by Hillier and colleagues [21].
relaxing the constraints of order and strandedness of the genes within the blocks. In general, relaxing the constraints regarding gene order, strandedness and mismatches generates larger and fewer synteny blocks when compared to the perfect synteny blocks. In contrast to relaxing the number of mismatches, relaxing the constraints of order and strandedness within blocks alone has only a weak impact on block size distribution, suggesting that insertions/deletions and long-range transposition events are much more common than inversion and short-range transposition events. One example of a larger synteny block found when relaxing only order and strandedness constraints is one with 9 genes in Chromosome III of the C. elegans genome ( Figure 5). This synteny block was split into two smaller ones when OrthoCluster was applied for detecting perfect synteny blocks. These two blocks, one of size 5 and the other 3, are separated by one in-map gene (F54G8.1) whose ortholog (CBG50416) is inverted with respect to the neighboring genes, hence disrupting the perfect conservation of strandedness. Allowing either in-map or out-map mismatches leads to the identification of larger synteny blocks because neighboring perfect synteny blocks start to merge. For example, using the improved annotation, when the percentage of both the in-map and the out-map mismatches are set to 5%, the largest block contains 71 genes (Figure 6a and 6b) (mean = 20.2 kb, median = 12.4 kb), compared to 42 genes identified as the largest block when no mismatches are allowed (Figure 1a; Figure 4). When these mismatch percentages are increased to 10% and 20%, the largest block contains 209 genes (mean = 26,7 kb, median = 12.0 kb) and 838 genes (mean = 45.1 kb, median = 14.1 kb), respectively. When we ran OrthoCluster by allowing a maximum of 50% inmap mismatch and 50% out-map mismatch within each synteny block, we found 80.8% of the genomic sequence of C. elegans being syntenic to 78.3% of the C. briggsae genomic sequence. As illustrated in Figure 6c, allowing more mismatches leads to merging of unrelated blocks because the genomic coverage increases sharply for mismatch percentages above this point. Also, for values larger than 50%, the number of synteny blocks decreases dramatically, mostly due the inclusion of in-map mismatches from unrelated regions of the genome (Additional file 7, Figure S4). At this setting, the median length of the synteny blocks found with this set of parameters is 15.6 kb (mean = 63.6 kb) (Figure 7). Again, the imperfect synteny blocks are not evenly distributed in the genomes. The mean size of imperfect synteny blocks is 53.6 kb (median = 15.7 kb) for autosomal synteny blocks, while 217.6 kb (median = 13.8 kb) for Chromosome X. This extremely large mean for the X chromosome compared to its median reflects that the size distribution of synteny blocks in the X chromosome is positively skewed (i.e., there are few very large synteny blocks). Within autosomes, again we don't observe a significant difference between centers and arms (p-value = 0.42, Mann-Whitney Test), with the median length of autosomal centers being 15.3 kb (mean = 62.1 kb), whereas the median length of autosomal arms is 16.6 kb (mean = 45.4 kb). This is in agreement with a previous report [5]. The largest synteny block spans 6.14 Mb on Chromosome X of C. elegans, between 1.68 Mb and Figure 5 Interruption of synteny block by disruption of strandedness. This region in Chromosome III of the C. elegans genome contains nine genes, all of which have one-to-one relationships to their orthologous genes in a syntenic region in C. briggsae. The perfect synteny is disrupted by one gene, F54G8.1, whose ortholog is inverted in C. briggsae. The two adjacent perfect synteny blocks are merged into one large synteny block when we allow strandedness of genes to vary. 7.82 Mb. Altogether, there are 11 synteny blocks that are larger than 1 Mb between these two genomes. They are distributed across all chromosomes of C. elegans except Chromosome I and III. These 11 largest synteny blocks add up to 26 Mb. These large synteny blocks are unlikely to be found by chance under a random breakage model, even after correcting for multiple testing (data not shown) [28]. There are altogether 161 synteny blocks that are larger than 100 Kb, which add up to 66 Mb in size, strongly suggesting that C. elegans and C. briggsae genomes share large synteny blocks. As shown in Figure 7, synteny blocks identified here are significantly larger that those identified using an alignment-based approach [19].

Discussion
In this work we applied our newly developed tool, OrthoCluster, for the detection of synteny blocks between the genome of C. elegans and the newly reconstructed C. briggsae genome. This anchor-based program has a number of features that makes it useful for identifying synteny blocks. In addition to identifying mismatches Imperfect synteny blocks in C. elegans. a) Synteny blocks generated by allowing a maximal percentage of in-map and out-map mismatches of 5%. b) The largest imperfect synteny block between C. elegans (containing 71 genes) and C. briggsae (containing 68 genes). c) C. elegans genomic coverage of syntenic blocks as a function of both in-map and out-map mismatches.  Figure 7 Size distribution of synteny blocks between C. elegans and C. briggsae. The red curve represents synteny blocks identified using OrthoCluster (ip = 50%; op = 50%), while the black curve represents synteny blocks reported previously [19].
within syntenic regions, it takes into consideration oneto-many orthologous relationships at the moment of identifying synteny blocks. It is also sensitive to gene strandedness. More importantly, OrthoCluster works with multiple genomes so that users can explore synteny among the expanding number of sequenced genomes. Now that the genomes of three additional Caenorhabditis species (C. remanei, C. japonica, and C. brenneri) have been sequenced, we are eager to apply OrthoCluster to identify and analyze synteny relationships among these genomes. The appropriate handling of these types of features enables users to detect genome rearrangement events such as insertions, deletions, duplications, inversions, and reciprocal translocations. Furthermore, OrthoCluster can be used for the detection of segmental duplications within a single genome [15]. Since OrthoCluster is an anchor-based program, correct annotation of the genetic markers coordinates used as anchors is an essential condition for the accurate estimation of synteny. Taken together, OrthoCluster is a flexible tool for the detection of synteny blocks among species of different evolutionary distance.
We have demonstrated that syntenic information is useful for the improvement of defective gene models and detection of potential new genes and missing orthologous relationships. In this attempt, we have identified 582 new gene models (Table 1) in C. briggsae and 52 candidate new gene models in C. elegans. These improved annotations enabled us to identify 949 new orthologous relationships. Some of the new gene models that we have identified were independently detected by WormBase curators. For example, gene C10A4.10 was absent in WormBase release WS180, but was later curated and released in WS190. This gene was detected also with our procedure (Additional file 8, Figure S5).
The improved genome annotations and orthologous relationships have helped the synteny block analysis since larger synteny blocks are found in contrast to those obtained with WS180 annotations (Figure 1). Also, some conserved operon structures are restored with the improved annotations (Additional file 9, Figure  S6). This methodology will be applied for improving the annotation of the newly sequenced genomes of C. remanei, C. brenneri, and C. japonica.
Hillier and colleagues constructed the first chromosomal level assembly of C. briggsae [21]. Taking advantage of OrthoCluster and this newly constructed C. briggsae assembly, we found that 80.8% of the C. elegans genome (and correspondingly 78.3% of the C. briggsae genome) is covered by synteny blocks that contain at least two genes. The amount of genome coverage by synteny blocks is consistent with a previous report [19]. Including "synteny blocks" composed of a single gene (in-map genes) only slightly increases the coverage of the C. elegans genome to 84.4% (corresponding to 81.9% of the C. briggsae). This coverage is also in excellent agreement with the work of Stein and colleagues (84.6% for C. elegans and 80.8% for C. briggsae) [19]. Thus, the conservation observed between the C. elegans and C. briggsae genomes is accounted for largely by synteny blocks that contain two or more genes. However, the synteny blocks discovered between C. elegans and C. briggsae using OrthoCluster (median size of 15.6 kb, average size of 63.6 kb) are much larger than those identified by the previous whole genome analysis (median size of 5.6 kb, average size of 37.5 kb).

Conclusions
Taken together, we have demonstrated that OrthoCluster can be used to accurately identify synteny blocks. Additionally, we have found that synteny blocks between C. elegans and C. briggsae are almost three-folds larger than previously identified.

OrthoCluster
OrthoCluster algorithm and development was described previously [6]. Briefly, it uses an anchor-based approach to effectively search for synteny blocks between two or more genomes given parameters for controlling synteny block size, mismatches within synteny blocks as well as preservation of order and strandedness (Additional file 10, Figure  S7). Since OrthoCluster takes into consideration both order and strandedness of genes, it is useful for the detection of inversions and other genome rearrangement events. In addition to identifying perfect synteny blocks (that contain no mismatches and preserve gene order and strandedness), it can be applied to identify imperfect synteny blocks with various levels of mismatches. OrthoCluster needs two types of input files (Additional file 11, Figure S8): a genome file and a correspondence file. A genome file contains genetic markers (which could be annotated genes) with information regarding chromosome/supercontig names, start and end positions, as well as the strand in which each genetic marker resides. A correspondence file provides orthologous relationships between two (for pair-wise analysis) or more genomes (for multiple-genomes analysis). Genetic markers that are not included in the correspondence file are called out-map genetic markers (in this paper, "genes" and "genetic markers" are used interchangeably). In contrast, genetic markers that are part of the correspondence file are called inmap genetic markers. A synteny block can be non-nested or nested (Additional file 12, Figure S9) with nested block defined as one that is contained within a larger block. A nested synteny results from a segmental duplication of a portion of a larger synteny block in one genome (Additional file 12, Figure S9d).

Data Sources
Genome annotations of C. elegans and C. briggsae were obtained from WormBase http://www.wormbase.org/ [17], release WS180. Since some genes produces multiple alternative isoforms and all of these isoforms represent one gene (locus), we used the longest isoform to represent a gene.

Correspondence file preparation
To generate the correspondence file required by OrthoCluster, we assigned orthologous relationships between different genomes using InParanoid [22,29] with default settings. InParanoid has been evaluated to be one of the best performing methods for orthology detection [29]. Ortholog assignment between C. elegans and C. briggsae is further improved based on gene model improvement, sequence similarity, and synteny when applying our gene model improvement procedure. A correspondence file contains both one-to-one and one-to-many relationships.

Synteny based gene model improvement and ortholog assignment
As illustrated in Figure 8, we first identified imperfect synteny blocks that contain out-map mismatch genes using OrthoCluster. Out-map mismatches, which usually indicate genome-specific genes, can also indicate these two alternative possibilities: (1) the ortholog gene in the other genome has not been found, and (2) the corresponding gene model is defective in a way the orthologous relationship can't be established by orthology detection programs. Synteny information helps narrow down genomic regions that contain these missing or defective orthologous genes and improve defective gene models. Once we identified mismatches in synteny blocks, we attempted to identify missing/defective gene models using the homology-based gene prediction method GeneWise [30,31]. When we ran OrthoCluster by allowing up to 20 out-map mismatches per synteny block, we found 2,650 imperfect synteny blocks, 2,389 of which are non-nested blocks and 261 are nested ones. Of the 1,886 out-map mismatch genes within synteny blocks in the C. elegans genome, 695 C. elegans genes generated GeneWise predictions in C. briggsae that satisfy the filtration criteria described below (Additional file 13). We only consider predictions that cover at least 60% of the length of the query proteins with no internal stop codons. We identified 771 GeneWise predictions in C. briggsae genome. Note that some out-map mismatch genes generate more than one valid prediction (paralogs) within the corresponding synteny block. Applying the same strategy, we identified 702 GeneWise predictions in C. elegans. Depending on which location of the synteny block the prediction hits, each of the predictions can be categorized accordingly. There are two possibilities. First, the predicted gene overlaps with an intergenic or intronic region. In this case, we take the predicted gene as a new candidate gene. Second, the predicted gene overlaps with one or more existing genes within the corresponding synteny block (Additional file 14, Figure S10).
We also assigned new orthologous relationships using synteny information and similarity (blast alignment scores). To achieve this, we compared the out-map genes with the new gene models and calculate their percentage identity (PID). We accept a new pair of orthologs if the PID between them is greater than or equal to 40% and the e-value is less or equal than 1e-10. The revised orthologous relationships were then incorporated into the InParanoid-driven orthologous relationships.

Additional material
Additional file 1: new gene models for C. elegans. gff3 file with the structure of all new genes in C. elegans.
Additional file 2: new genome annotation for C. briggsae. gff3 file with the structure of all genes in the new genome annotation for C. briggsae New genes start with ID CBG5XXXX.
Additional file 3: Figure S1 genome view of the perfect synteny blocks between C. elegans and C. briggsae. Each chromosome in C. elegans has a distinctive color. The corresponding synteny blocks in C. briggsae can be mapped to the reference chromosome according to the color. This image was created using OrthoClusterDB http://genome.sfu. ca/orthoclusterdb/.  Figure 8 Synteny-based gene model improvement procedure. First, out-map mismatches are identified in the synteny blocks. Second, GeneWise is run to identify candidate genes using out-map mismatches as queries and the corresponding syntenic region as target. Third, predicted genes are examined and compared with other genes in the synteny blocks (proteins encoded by the predicted genes are at least 60% as long as their corresponding query proteins).