Compacta: a fast contig clustering tool for de novo assembled transcriptomes

Background RNA-Seq is the preferred method to explore transcriptomes and to estimate differential gene expression. When an organism has a well-characterized and annotated genome, reads obtained from RNA-Seq experiments can be directly mapped to that genome to estimate the number of transcripts present and relative expression levels of these transcripts. However, for unknown genomes, de novo assembly of RNA-Seq reads must be performed to generate a set of contigs that represents the transcriptome. These contig sets contain multiple transcripts, including immature mRNAs, spliced transcripts and allele variants, as well as products of close paralogs or gene families that can be difficult to distinguish. Thus, tools are needed to select a set of less redundant contigs to represent the transcriptome for downstream analyses. Here we describe the development of Compacta to produce contig sets from de novo assemblies. Results Compacta is a fast and flexible computational tool that allows selection of a representative set of contigs from de novo assemblies. Using a graph-based algorithm, Compacta groups contigs into clusters based on the proportion of shared reads. The user can determine the minimum coverage of the contigs to be clustered, as well as a threshold for the proportion of shared reads in the clustered contigs, thus providing a dynamic range of transcriptome compression that can be adapted according to experimental aims. We compared the performance of Compacta against state of the art clustering algorithms on assemblies from Arabidopsis, mouse and mango, and found that Compacta yielded more rapid results and had competitive precision and recall ratios. We describe and demonstrate a pipeline to tailor Compacta parameters to specific experimental aims. Conclusions Compacta is a fast and flexible algorithm for the determination of optimum contig sets that represent the transcriptome for downstream analyses.

. Contigs to be cluster by Compacta. Black lines represent contigs c 1 , c 2 and c 3 . Colored lines represent reads r 1 to r 11 , and the plot shows how each read maps to the contigs. Blue lines are reads unique to each contig, red lines are reads shared by contigs c 1 and c 2 while orange line is read shared by contigs c 2 and c 3 .
Before proceeding, it is worth noting that the information present in the BAM file, and graphically summarized in Figure 1, will be reduced by ignoring the exact position of each read at each contig, considering contigs only as subsets of all the reads, which are the 'universal set' of the problem, say Ω = {r 1 , r 2 , · · · , r 1 1}. Thus, Compacta will consider contigs as subsets of reads, say c 1 = {r 1 , r 2 , r 3 , r 4 , r 5 , r 6 }, c 2 = {r 4 , r 5 , r 7 , r 8 , r 9 } and c 3 = {r 9 , r 10 , r 11 }, while the information about contig and read lengths will be stored separately.
We modeled using unpaired reads of 150 bp, and Figure 1 shows how each read maps to the contigs. Blue lines (reads r 1 , r 2 , r 3 , r 6 , r 7 , r 8 , r 10 and r 11 ) are reads that map to a single contig, r 1 , r 2 , r 3 and r 6 to contig c 1 , reads r 7 and r 8 to contig c 2 and reads r 10 and r 11 to contig c 3 . On the other hand, reads r 4 and r 5 (red lines) map to both contigs c 1 and c 2 , while read r 9 (orange line) is shared by contigs c 2 and c 3 , i.e., this read maps to both contigs. Table 1 presents the characteristics of the contigs that Compacta take into account for the step 'Filtering of low evidence contigs' of the algorithm. Table 1. Length, number of reads and effective coverage for the contigs presented in Figure  In contrast with other programs for contig clustering, as Corset (Davidson and Oshlack, 2014) or Grouper (Malik et al., 2018), Compacta do not simply set a uniform threshold on the number of reads per contig for them to be clustered, but takes into account the effective contig coverage which is the sum of the lengths of all reads group in the contig divided by its length. Given that we used reads of 150 bp, we can see that the effective coverage of contigs c 1 , c 2 and c 3 is 2.4, 1.875 and 2, respectively.
If this example is run in compacta with the default parameter "-l 2", i.e., asking for a minimum effective coverage of l = 2, then contig c 2 will not surpass this threshold (1.875 < 2) and it will be marked as a 'low evidence contig' and output as a singleton, without entering into the Compacta clustering algorithm. In what follows we assume that compacta was run with parameters "-l 1.5 -d 0.3", and thus the three contigs in the example will pass the minimum effective coverage filter and enter into the pre-cluster (step number 3) of the Compacta algorithm.
In Compacta the formula to compute the weight w ij between contigs i and j is given by where R i and R j are the number of reads mapping to contigs i and j, respectively, and R ij is the total number of reads that map to both contigs i and j. Table 2 shows the calculation of the weights, w ij , between the set of three contigs previously shown in Figure 1. Table 2. Calculation of weights, w ij , for all pairs of contigs in the graph calculation.
Contig i Contig j w ij c 1 c 2 2/ min(6, 5) = 2/5 = 0.40 c 2 c 3 1/ min(6, 3) = 1/3 = 0.33 c 1 c 3 0 given that R 1,3 = 0 Tp proceed with the treatment that Compacta will give to the input data it is important to note that contigs c 1 , c 2 , c 3 form a closed subgraph, because we have reads that connect the three contigs, say c 1 ↔ c 2 ↔ c 3 (even when c 1 is not directly connected with c 3 ). Thus, in step 4 of the algorithm (Precluster detection), Compacta will detect the fact that the contig set {c 1 , c 2 , c 3 } forms a pre-cluster, and will load the information about the weight of all the contig pairs into an auto sorted heap structure, in the decreasing weight order which it is presented in Table 2.
The clustering step proceeds by calculating the weights between the elements currently present in the pre-cluster, say the new cluster C 1 = c 1 ∪ c 2 and the original contig c 3 . That weight is w C1,c3 = 1/ min(9, 3) = 1/3 = 0.33. Given that this weight is larger than the threshold d = 0.3, a new cluster, say C 2 = C 1 ∪ c 3 = {r 1 , r 2 , · · · , r 11 } is formed, and this new cluster includes all reads present in the example and will be reported in the output. The following text shows the structure of the output files for this example when run with parameters -d 0.3 -l 1.5 where contig.1 = c 1 , contig.2 = c 2 and contig.3 = c 3 and the final result shown above, say C 3 is re-named here as Cluster.1 In contrast, if the example is run with the default parameters, -d 0.3 -l 2, then the Compacta algorithm will not group any of the three original contigs into clusters, because c 2 will not pass the threshold l = 2, and thus pre-cluster {c 1 , c 2 , c 3 } will not exist and c 2 will be reported as a 'singleton', while c 1 and c 3 will be reported as 'clusters with a single contig'.

Comparing Compacta with other clustering algorithms
In this section we include additional tests performed to evaluate the Compacta, Corset and Grouper algorithms using sub-sets of data, created from the Arabidopsis and mouse transcriptomes (Table 3), using the new tuxedo pipeline (Pertea et al., 2016). First we aligned the reads to their respective genome using Hisat2 (Kim et al., 2015) with default parameters. The '--dta' option was used to allow compatibility with the Stringtie assembler, and the '--known-splicesite-infile' option was used with the splicing sites extracted from the annotation file. The output files were sorted and converted to bam format using Samtools 1.5 (Li et al., 2009). Then we used StringTie (Pertea et al., 2015) to estimate transcripts abundances on the bam alignment files using parameter '-G' to include the annotation file, '-B' to output tables compatible with Ballgown and '-e' to limit the read estimation only to transcripts matching the reference. We used two Bioconductor (Huber et al., 2015) packages, Ballgown (Frazee et al., 2014) to load the abundance matrix created with Stringtie, and Polyester (Frazee et al., 2015) to simulate two treatments with 30 million paired-end reads for each organism based on the abundances of the real datasets obtained with Stringtie and Ballgown. Only samples ERR1592576 and ERR1592579, SRR7248743 and SRR7248745 of the Arabidopsis and mouse datasets,respectively,were used to create the simulated datasets. Additionally, we created two extra datasets based on the abundances obtained from the real data of Arabidopsis to test the effect of coverage on the clustering efficiency, each set includes two treatments with 10 and 50 million paired end reads per sample,respectively.
Polyester output consists in fasta files without quality scores or sequencing adapters to remove so we skipped the Quality Control step and proceeded to the de novo assembly using Trinity with default parameters including the '--seqType fa' flag to specify the input format, the number of contigs produced by Trinity for each organism can be seen in Table 3. Simulated reads were mapped back to their respective transcriptome as paired-end using Hisat 2 with parameters '-f -k 100 --score-min L,-0.1,-0.1 --no-spliced-alignment', obtaining at least an 88% mapping rate per sample. We used Salmon v0.12.0 (Patro et al., 2017) with options '--dumpEq --writeOrphanLinks' to pseudo map clean reads against their respective transcriptome in order to obtain the equivalence classes needed to run Grouper, a mapping rate of at least 95% was obtained for each sample. Output files created by Hisat2 and bowtie2 were converted to BAM format (Li et al., 2009) using Samtools 1.5 (Li et al., 2009), BAM files were processed with corset 1.07 using default parameters adjusting group definition (-g option) according to samples in each experiment, and we set parameter '-D 9999999999' to disable paralog detection. BAM files were then processed with Compacta 1.0 using default parameters, adjusting '-n', '-g' and '-s' options to match the structure of each dataset. We clustered the de novo assembled transcriptomes using CD-HIT v4.7 (Li and Godzik, 2006) with parameters '-c 0.95 -n 8 -T 0 -g 1 -d 0'. To obtain the assembler cluster list we used the script get Trinity gene to trans map.pl from Trinity on each of the assembled transcriptomes to create a list of contigs per cluster. Finally, clustering with Grouper was carried out using the equivalence class files produced by Salmon as input and default parameters with 'orphan: True mincut: True'. Figure 2 shows the execution time in seconds for Compacta, Corset and Grouper on each of the simulated datasets, Trinity and CD-HIT were not included as their running time is not fairly comparable with the other clustering tools. In this figure each bar represents the running time of each program on one of the assemblies, x-axis groups bars by program and colors identify the assembly of each organism, Compacta is faster clustering both the Arabidopsis and the mouse simulated dataset than Corset -even thought the difference is not as remarkable as with the real datasets, comparing Compacta vs Grouper, we can see that Grouper is faster on the Arabidopsis simulated dataset, this can be explained by the fact that experiments with low-complexity are easy to cluster for the three algorithms, however, both Compacta and Corset load BAM files which consists in Gigabytes of information that must be parsed before clustering, while Grouper loads Equivalence files, smaller plain text files that are easily and quickly parsed; still, on the mouse simulated dataset Compacta was the fastest while Grouper was the slowest of the three programs, just as with the real datasets this shows that on tests with more complex data the time required to load and parse the input files becomes insignificant compared to the time required by the clustering algorithm, proving that Compacta is more efficient and makes a better use of the computer resources compared to Corset and Grouper.
Just like with the real datasets (see Figure 1 in the main text), we calculated recall (R), and precision (P) for the 5 clustering algorithms, 3 of which are 'read-based': Compacta, Corset and Grouper, plus CD-HIT, which is based on information from the contigs sequences, and Trinity, which is based on De Bruin graphs. All five algorithms were tested using the assemblies of the simulated datasets of Arabidopsis and mouse described previously, the five programs were run using default parameters and the results are shown in Figure 3.
In Figure 3 we can see that as with the real datasets, CD-HIT was the program with the poorest recall and the highest precision, again, this can be explained by the fact that CD-HIT uses information from the contigs sequences, this allows it to group highly similar contigs which rises the precision, it is worth to notice that precision is only calculated on the clustered contigs even though CD-HIT produces the smaller number of clusters it has a high precision; this approach to clustering leads CD-HIT to fail to group contigs related that vary at the sequence level, in fact, CD-HIT has trouble identifying alleles, and genes with SNPs and INDELs and it might produce different clusters for each one even thought they should be together, this explains the low recall. However, read-based algorithms such as Compacta, Corset and Grouper seem to be more robust against the problems found in CD-HIT, in fact, the three algorithms show a more balanced precision and recall, in both test Corset posses a higher precision of the three programs, but it also shows the lowest recall of the three; Grouper competes with Compacta for the highest recall of the five algorithms, nevertheless, Grouper has an unexplained drop of precision in the mouse test; on the other hand, Compacta has shown greatest recall than almost all algorithms and a precision that is slightly lower than programs like CD-HIT or Corset but in general we think it has a good balance of precision and recall.
We created data sets to evaluate the behavior of the three read-based algorithms: Compacta, Corset and Grouper under different sequencing deep. With this aim we used three sets of simulated data of Arabidopsis at different sequencing levels, say, 10, 30 and 50 million reads, respectively. After mapping the reads back to their respective transcriptome, clustering with the three algorithms was performed and precision and recall were estimated. Figure 4 presents the results of these tests.
In Figure 4 we can see that the three programs show a stable precision through the three assemblies with different sequencing level, however, both, Grouper and Corset, show differences in recall across the three experiments, with an apparent decrease of recall as the sequencing level increases. In contrast, Compacta not only presents a stable precision on the three experiments, but it also shows a slightly increase of recall as the sequencing level increases. This implies that Compacta is not only faster than Corset and Grouper on complex datasets, giving a well balanced precision-recall, but also it might have a more stable behavior (precision and recall) as the number of reads increases. These results show that Compacta is a more robust algorithm than the ones implemented in Corset and Grouper.
We have previously shown how the parameter d can affect the number of clusters and the number of contigs per cluster produced by Compacta, but this is not the only characteristics affected by this parameter as changing the cluster configurations might affect the performance of the clustering algorithm. In order to evaluate the effect of the cut off value (parameter d ), in the performance of Compacta, we used the simulated dataset of Arabidopsis with 30 million reads, and ran Compacta with default parameters using values of d from 0.1 to 1 with increments of 0.1, then we calculated precision and recall from the outputs produced by the algorithm. Results of these tests are shown in Figure 5. Three experiments of Arabidopsis realistic simulated datasets with 10, 30 and 50 Million pair-end reads per sample, clustered using Compacta, Corset and Grouper shows that sequencing deep has an impact on the clustering performance. Compacta recall increases with the number of reads and manages to maintain stable after 30 Million reads, while Corsets recall decreases as the number of reads per sample increases.
As we can see in Figure 5, the first test (0.1) posses the highest recall and the lowest precision, and as we increase the value of d the precision increases but some recall is lost in process, finally, with a d of 1 both precision and specially recall drops, because the threshold is so small that only contigs sharing lots of reads are clustered together, and some of the related contigs are being left out without being clustered, we choose to use a d value that produces a balanced values of recall and precision, in this case we used 0.3 as the value of d, but other values like 0.2 or 0.4 would be as good, depending on the user preferences. Figure 5. Effect of parameter d on precision and recall. Precision and Recall of ten runs of Compacta on the Arabidopsis simulated dataset using different clustering threshold values (parameter d ), when we choose a threshold value closer to zero the algorithm clusters more contigs rising up the Recall but loosing precision, and when we choose a threshold value closer to one the algorithm gets more strict on choosing which contis to cluster which reflects in a greater precision but leaving out many contigs dropping the recall.

Compacta performance with a de novo assembled transcriptome
The Arabidopsis transcriptome employed here came from the experiment in (Liu et al., 2016). The full experiment produced approximately 36 GB of high-quality reads. For the proposes of Compacta evaluation these reads were assembled de novo using Trinity v2.4.0 (G Grabherr et al., 2011) with default parameters. The assembly gave a total of 106,895 contigs with length varying from 201 to 18663 bp with mean and median sizes of 1987 and 1636, respectively. The sum of the length of all contigs was ≈ 212.4084 Mb, while the standard deviation of contig size wasŜ ≈ 1626.
The figures above give evidence of the problem that arises with de novo transcriptome. We have 106,895 contigs; however, the true maximum of sequences that can be produced by the Arabidopsis genome is 41,671, thus the 'redundancy' in the assembled transcriptome is at least, 106895/41671 ≈ 2.6, and that is assuming that absolutely all 33,602 Arabidopsis genes were expressed in the experiment in all the 41,671 possible variant forms. Of course, that assumption is highly unlikely, thus we can be sure that the redundancy of sequences in the assembly must be > 2.6 In this section we will consider the number of clusters given by Compacta, say z, as a function only of the parameter -d, say z = f (d). The parameter d, which can vary from zero to one, gives a threshold to consider that two contigs (in the input) belong to the same cluster (in output), and it controls clustering extent. Thus, with a value of d = 0, all contigs that share one or more reads will be cluster, while with d = 1 only contigs that are subsets one of the other will be reported as clusters.
Compacta was applied to the pre-processed files ('pre-clusters') of the transcriptome, which in turn were obtained from the original BAM files downloaded from the source of the transcriptome, see (Liu et al., 2016). Compacta was run with a grid for the parameter -d from 0 to 1 in 0.1 increments (11 values). Then, a finer grid was tried near the points where the slope of the function z = f (d) had a sudden change. Table 4 gives values of z for selected values of d, while Figure 1 in the main text shows the estimated function z = f (d) (as percentage) for the transcriptome analyzed. In Table 4 Table 4 we can see the approximate estimated slope of the z = f (d) curve at the d points (column '≈ Slope z = f (d)'). Near the 'critical' points d = 0.035 and d = 0.955 is where the estimated slope of the curve (90800 and 89800, respectively are closer to the slope assuming linear behavior, b = 89492. Column %max in Table 4 shows the percentage of z with reference to the maximum number of clusters that can be obtained with Compacta, i.e., the percentage of z with reference to103262 (obtained with d = 1). At d = 0 the value of %max is 13.34, indicating that there is not possible to select a smaller percentage of contigs to represent the full collection. On the other hand, the last column in Table 4, z/ min(z), shows the level of 'compression' achieved by the algorithm with different values of d, for example, with d = 1 we have 7.5 more clusters than the minimum number obtained d = 0, etc.
Even when in Figure 2 of the main text the function z = f (d) resembles a continuous function, it is obvious that it is in fact a non-decreasing step function of the continuous threshold d. It is a step function because the value of z -the number of clusters, is a discrete quantity, while it is non-decreasing because if at d 1 we have that f (d 1 ) = z 1 , then for every d 1 + with > 0 and d 1 + ≤ 1, it follows that f (d 1 + ) ≥ z 1 , i.e., if two or more contigs had been clustered with a value of d 1 , they will remain clustered at d 1 + and thus the total number of clusters, z, must be the same or increase, but it will never decrease (when two contigs had been integrated into a cluster, there is no way in which they could be segregated by a further step of the algorithm).
The exact form of the function z = f (d) is 'transcriptome dependent' -meaning that it is the result of various factors, including genome structure, 'treatments' under which the transcriptome was obtained (tissues, conditions, etc.), the gene expression profile under each 'treatment' as well as the length and number of reads employed to obtain the transcriptome. Let's assume that the sequencing depth of each one of the libraries is not limiting; i.e., that even the genes with lower expression are represented in the transcriptome by at least one contig. Then, the size of the reads employed will be the most important factor to confound two of more contigs that are generated from two or more different loci. For example, if the the transcriptome is fully sequenced with very log reads, as the ones generated by the PacBio technology (Rhoads and Au, 2015), we could have that all different transcripts will be shorter than the average read length and then there will be not 'shared' reads between contigs and thus no need to apply any algorithm for contig clustering; in fact, in that case the function z = f (d) will be just a constant z * , independent of the values of d. With reads smaller than the average size of the transcripts it becomes possible that some reads will be shared between two or more of the assembled contigs; the corresponding contigs will be linked by reads representing motifs shared between alleles, splicing variants, post-transcriptional modifications, etc. Having all other conditions constant, smaller reads will cause a larger number of contigs in the output of the assembler and in that case algorithms such as Compacta will be useful to select a representative set of contigs (clusters) for downstream analyses.
The relevant question is how to select the value of d -and of course, it depends to some extent of the aim of the research project. If the researcher wants to perform a very general analysis, for example, to identify the main genes or groups of genes represented in the transcriptome, then a value of d = 0 will give the maximum transcriptome 'compression' by selecting only one contig from each group. On the other hand, if a very detailed analysis is desired, a value of d = 1 will filter only contigs with low coverage selecting all different transcripts, except those that are subsets of each other.
A third and more frequent case is that the researcher wants a set of contigs that represents the most relevant groups of transcripts with a relatively low level of redundancy. In such case the researcher needs some guidance about the value of d that will give such set. An advantage of using the Arabidopsis transcriptome is that its genome is very well annotated; of course, that benefit will be absent in all cases where the transcriptome came from an unknown (not sequenced) genome. However, investigating the number of loci expressed in the example transcriptome will suggest general guidelines for the selection of a suitable value of d in the general case.
3.0.1. The set to compare the Arabidopsis transcriptome. To compare the assembly as well as Compacta results we downloaded from TAIR10 blastsets the file TAIR10 cdna 20101214 updated, which contains all cDNA's known from Arabidopsis (defined as CDS+UTRs-introns). The Arabidopsis cDNA file contains 41,671 sequences, corresponding to 33,602 loci, thus, in average, each locus is represented by 41671/33602 ≈ 1.24 sequences. This cDNA sequences were formatted as a BLAST DB named here 'TAIR10cdna'. Note that in the guiedelines for Arabidopsis nomenclature, each locus is identified by a unique chain of symbols, for example as 'AT1G51370.2', meaning 'AT' = Arabidopsis thaliana, '1' = chromosome 1, 'G51370' = gene 51370 and, finally '.2' represents the second sequence from that locus (G51370).
The 106895 contigs in the Arabidopsis assembled transcriptome were used as queries to the BLAST database TAIR10cdna of all 41671 sequences of cDNA's from this plant. The BLAST experiment was performed with an in situ installation of the NCBI BLAST implementation [Package: blast 2.7.1, build Oct 18 2017] (Zhang et al., 2000). The command line for the blast experiment was blastn -query allContigs.fasta -db TAIR10cdna -out allContigs_out1.txt -outfmt 6 -dust no -max_target_seqs 1 -ungapped This command line implies the use of the task 'megablast' in tabular format with no 'dust' filter and giving only one result for each one of the contigs queried. Only continuous (non-gapped) hits were taken into account. The resulting file 'allContigs out1.txt' was included into the table 'allContigs out1' of the MySQL DB 'razo'.
The results in 'allContigs out1' included 194,641 rows for 94,286 contigs having hits to 27,016 distinct Arabidopsis sequences. A total of 106895 − 94286 = 12609 contigs (≈ 12% of the total) did not gave any hit, showing that such contigs could be, either, assembly artifacts or transcripts from post-transcriptional modifications which are not currently annotated in Arabidopsis.
Next step in the analysis was to select only the hit with maximum bit score, i.e., the most 'significant' for each pair of contig / Arabidopsis sequence. These results are in table 'contigs2locus', and consist in 94,286 pairs (concordances contig / Arabidopsis sequence) for 27016 different Arabidopsis sequences. This implies that we have in average 94286/27016 ≈ 3.49 contigs for each one of the Arabidopsis sequences.
The next problem was to determine which of these 94286 pairs (concordances contig / Arabidopsis sequence) where relevant or 'significant' in the sense that a given contig was appropriately representing an Arabidopsis locus. To consider a BLAST result as relevant we asked that the length of the alignment between the contig and the Arabidopsis sequence include a minimum of 75% of the contig OR a minimum of 75% of the Arabidopsis sequence AND presents a minimum bit score of 90. This last requirement implies that the maximum expected value (BLAST 'E-value') was 1.47 × 10 −16 , i.e., extremely significant from the statistical point of view. From these BLAST experiment we can conclude that in principle a very small number of contigs, ≈ 23607, could be sufficient to represent the same number of different Arabidopsis sequences expressed within the transcriptome. That target number of contigs is shown as a violet arrow in Figure 1, and it implies that a very small value of d, say d ≈ 0.01, could give such small number of representative contigs. However, in principle nothing guarantee that the z contigs selected by Compacta will reasonably represent all 23607 Arabidopsis sequences. To study that problem it is necessary to see how many Arabidopsis sequences or loci are represented by the z contigs selected with a value of d. Table 5presents the selected number of contigs (n con ), number of Arabidopsis sequences represented (n As ) by having a bit score ≥ 90 (E-value ≤ 1.47 × 10 −16 , the percentage of the total of Arabidopsis sequences represented (%As) and the percentage of efficiency (Ef = 100 × n As /n con ) for different values of d. Also Figure 2 in the main text presents the plot of %As and %Ef as function of d.
From Table 5 we can see that both, n con and n As are increasing functions of d; larger values of d imply larger numbers of representative contigs selected and as consequence a larger number of clearly identified Table 5. Selected number of contigs (n con ), Number of Arabidopsis sequences represented (n As ), Percentage of the total of Arabidopsis sequences represented (%As) and Percentage of Efficiency (Ef = 100 × n As /n con ) for different values of d. Arabidopsis sequences. Each contig employed is the largest one of the corresponding Compacta's cluster. It is also important to note that, for all values of d, n con > n As ; i.e., the number of contigs employed is always larger than the number of Arabidopsis sequences identified by them, which in turn means that there are some identification redundancy and not a one to one relation between contig and Arabidopsis sequence.
Also from Table 5, but more clearly in Figure 2 in the main text, we can see that %As is a not decreasing function of d -as z = f (d) in Figure 2 in the main text. Also as z = f (d) in Figure 2 in the main text, %As presents a very steep slope for 0 ≥ d ≤ 0.05, going from 14.17 at d = 0 to 77.86% at d = 0.05; ∆%As/∆d = 1273.8 Thus at d = 0.05 we have 77.86% of all Arabidopsis sequences correctly identified. At d = 0.05 we have the maximum value of Ef = 92.08; at that point the number of identified sequences per contig peack, decreasing at relatively slow pace until d = 0.75, where it goes down fast to its minimum value of only 26.09 at d = 1.
From Figure 2 in the main text it is obvious that there is not a value of d which maximizes both, %As and Ef simultaneously. The selection of any d value means a compromise between number of contigs to be analyzed and the proportion of Arabidopsis sequences correctly identified in the transcriptome.
However, given that Compacta is fast and efficient, rounds of runs can be performed until a suitable value of d -and consequently a set of z contigs to be analyzed, can be heuristically determined to optimize the set which is best suited for the research objectives. Compared with the cost of data gathering (RNA-Seq library construction and sequencing), the cost of analyses is insignificant.
All results presented in this subsection are available from the authors as a relational database.

Comparing differential expression
Contig clustering may affect the gene expression analysis; poor precision caused by over-clustering can lead to problems detecting deferentially expressed genes because genes with different relative expression are clustered together, while poor recall caused by under-clustering increases the size of the reported transcriptome, making it difficult to execute downstream analysis Davidson and Oshlack (2014). To test the effect of Compacta's clustering on gene expression analysis, we used the real datasets from Arabidopsis and Mouse, previously employed. We used Hisat2 to map the reads back to their corresponding transcriptome and Compacta to cluster the contigs and produce cluster-contig lists and raw counts matrices. Then we used RSEM v1.3.1 on the cluster definitions produced by Compacta to build references with parameters 'rsem-prepare-reference --bowtie --transcript-to-gene-map', and again, we used RSEM but this time to calculate the expression levels with parameters 'rsem-calculate-expression --estimate-rspd --paired-end' to get cluster level counts. We also used RSEM on the same datasets but this time we build the references using the corresponding genome with parameters 'rsem-prepare-reference --bowtie --gtf' then we calculated gene level expression with 'rsem-calculate-expression --estimate-rspd --paired-end --no-qualities' to get a reference based on truth data.
We used the count matrices produced by Compacta, the cluster level counts and the gene level counts produced by RSEM in EdgeR to carry on differential gene expression analysis using the exact test with F DR = 0.01 and lf c >= 1 to detect significant differentially expressed genes. For each test we performed cluster identification against the corresponding genome using Blast-n v2.6.0 with parameters '-max target seqs 1 -outfmt 6 -evalue 1e-3' and discarded alignments with a length under 200 bp and below 98% identity. Also we ignored clusters without a match in the genome-based differentially expressed gene list and when there was more than one cluster per gene only the one with the biggest lf c value was picked. Then we used Pearson's and Spearman's correlation on the filtered significantly differentially expressed clusters obtained from RSEM vs the differentially expressed genes (genome-based) and on the DE clusters obtained from Compacta counts vs DE genes from the genome-based tests. Results are presented in Figure 6.
In Figure 6, subfigures A and B show the correlation on differentially expressed clusters of Arabidopsis using RSEM and Compacta, respectively, against the differentially expressed genes guided by the reference genome. Both graphs show a correlation greater than 0.96 with an α = 0.9 The correlation values vary slightly in both tests. This means that statistically significant DE clusters obtained by using the count data output by Compacta in a Differentially Expressed gene analysis are very similar to those DE genes selected from a genome guided DE analysis, and these results are not different from using abundance estimating tools like RSEM. Subfigures C and D show the correlation of differentially expressed clusters of mouse using RSEM and Compacta as abundance estimating tools, respectively. We can see that they have a correlation against the differential gene expression analysis guided by the reference genome greater than 0.9 with an α = 0.9. This test show correlation coefficients with smaller values than the ones in the Arabidopsis tests which can be explained by the complexity of the dataset. However, Compacta shows a slightly higher correlation coefficient (with a difference of 0.0103 and 0.0377 for Pearson's and Spearman's, respectively). The difference in RSEM correlation coefficient against Compacta might be caused by the fact that RSEM estimations tends to report fewer counts for a small number of genes than other abundance estimation tools, as it has been previously discussed in Davidson and Oshlack (2014). This leads to missing true DE genes in downstream analysis, and at the same time to a smaller correlation against the truth datasets. Figure 6. Effect of contig clustering on differential gene expression analysis. Significantly differentially expressed clusters were compared to genes tested for differential expression using a genome-based mapping approach. Correlation of log 2 fold changes between significantly differentially expressed clusters compared to genes tested for DE using a genome-based mapping approach for the Arabidopsis and Mouse real datasets. A) and B) show the correlation of log 2 fold change of DE genes of the Arabidopsis real dataset using RSEM and Compacta for transcript quantification, respectively. C) and D) show the correlation of DE genes of the Mouse real dataset using RSEM and Compacta for transcript quantification, respectively.