Skip to main content

Table 1 Variables and their weights used to filter putative transcripts

From: Rapid transcriptome sequencing of an invasive pest, the brown marmorated stink bug Halyomorpha halys

Attribute

Weight

Variable

Reason

A) Based on the putative transcript sequence

   

1. What proportion of the database protein is covered in the first Uniref100 hit?

10

Proportion covered

Reward the putative transcript based on the proportion of the database protein that is covered in the best BLASTX hit

2. What proportion of the putative transcript is covered by the first Uniref100 hit?

8

Proportion covered

Reward the putative transcript based on the proportion of the query putative transcript that is covered in the best BLASTX hit

3. What is the length covered on the database protein in the first Uniref100 hit?

7

Database hit length/longest database hit length

Reward based on the absolute database protein length covered in the best BLASTX hit, compared to the longest hit length in the component

4. What is the length covered on the putative transcript in the first Uniref100 hit?

5

Putative transcript hit length/longest putative transcript hit length

Reward based on the absolute query putative transcript length covered in the best BLASTX hit, compared to the longest hit length in the component

5. Is the strand of the Uniref100 match, the expected one (based on SSLR)?

4

Match strand* (−SSLR/max |SSLR|)

Reward matches in plus strand if SSLR <0 or matches in minus strand if SSLR >0. In contrast, penalize matches in plus strand if SSLR >0 or matches in minus strand if SSLR <0

6. What proportion of the database protein is covered in the first NR hit?

9

Proportion covered

Same as the corresponding metric for Uniref100

7. What proportion of the putative transcript is covered by the best NR hit?

7

Proportion covered

Same as the corresponding metric for Uniref100

8. What is the relative length covered on the database protein in the first NR hit?

6

Database hit length/longest database hit length

Same as the corresponding metric for Uniref100

9. What is the relative length covered on the Trinity putative transcript in the first NR hit?

4

Putative transcript hit length/longest putative transcript hit length

Same as the corresponding metric for Uniref100

10. Is the strand of the NR match, the expected one (based on SSLR)?

3

Match strand* (−SSLR/max |SSLR|)

Same as the corresponding metric for Uniref100

11. Is the SSLR negative (i.e. the expected)?

7

- SSLR / max |SSLR|

Reward putative transcripts with the normal, negative SSLR

12. How long is the putative transcript compared to the longest in the component?

7

Putative transcript length/longest putative transcript length

Reward longer putative transcripts

B) Based on the ORFs

   

1. Is the best match for each ORF the same?

10

(1 - Number of best matches)/number of best matches

Penalize putative transcripts having ORFs that have different hits.

2. Are there ORFs in both strands with both having an NR hit?

10

- Number of ORFs in strand "A"/number of ORFs in strand "B"

Maximum penalty if both ORFs have a NR hit

3. Are there ORFs in both strands with only one having an NR hit?

8

- Number of ORFs in strand "A"/number of ORFs in strand "B"

Intermediate penalty if only one of the ORFs has a NR hit

4. Are there ORFs in both strands with none of the two having an NR hit?

3

- Number of ORFs in strand "A"/number of ORFs in strand "B"

Small penalty if none of the ORFs have a NR hit

5. How many ORFs are called?

8

(1 - number of ORFs)/number of ORFs

Penalize putative transcripts having >1 ORFs

6. Are the ORFs found only in the expected strand (SSLR)?

8

ORF strand* (−SSLR/max |SSLR|)

Reward putative transcripts having ORFs called in only the expected strand

C) Sequencing coverage dips

   

1. How many sequencing coverage dips?

10

- Number of dips/max number of dips in the component

Penalize putative transcripts with sequencing coverage dips