This study examines gene expression, molecular evolution, and gene structure information of singleton genes in the diploid species *S. tropicalis* that are orthologous either to a pair of ohnologs in the tetraploid species *X. laevis* or to a single-copy gene in *X. laevis*. We assume that all single-copy genes in *X. laevis* were once part of a pair of ohnologs but that one copy has been lost due to post-WGD pseudogenization. Our analyses therefore consist of two gene sets: (1) gene triads, which include a pair of *X. laevis* ohnologs and the corresponding *S. tropicalis* singleton ortholog, and (2) gene dyads, which include one *X. laevis* singleton and the corresponding *S. tropicalis* singleton ortholog.

Nucleotide sequences from 3,387 gene triads and 4,746 dyads were gathered from the NCBI UniGene databases using tBLASTx and a reciprocal best hit approach. UniGenes are a set of non-redundant clusters of transcript sequences that compose the expressed sequence tag (EST) libraries. In a gene triad, the two *X. laevis* UniGenes were reciprocal best hits within the *X. laevis* UniGenes, and they both returned the same *S. tropicalis* UniGene as the top hit. In a dyad, each putative *X. laevis* singleton has a single unique reciprocal top hit with *S. tropicalis*. UniGenes that had more than two *X. laevis* genes or more than one *S. tropicalis* gene with reciprocal best hits were excluded. To test whether our putative singletons were indeed singletons, we tried to amplify the other ohnolog of 17 *X. laevis* singletons using PCR primers designed to amplify both the *X. laevis* singleton and the *S. tropicalis* ortholog [41]; in all cases only one gene copy was amplified in *X. laevis*. Triads and dyads were aligned using MUSCLE [42], and Perl scripts were used to predict the beginning and end of coding regions by looking for the longest open reading frame in either direction. Overlapping alignments shorter than 201 nucleotides were discarded along with 5' and 3' untranslated regions and indels.

Non-normalized EST libraries were obtained from NCBI (dbEST Library IDs: 10829, 10830, 10895, 10896, 20954, 20886, 21298, 20560, 20561, 20562, 20892, 20891, 20911, 20931, 20912, 20947, 20953, 20901, 16856, 16857, 16858, 16871, 16854, 16853, 16863, 16862, 16870, 16864, 16872, 17807, 16868, 16869, 16867, 16865, 16866, 17804, 17805, 17806, 16855, 16873, 16876, 16875, 16877, 16878, 16874, 16801, 16859, 16860, 16861, 16880, 8773, 8701, 20682, 9909, 9665, 9908, 14603). We classified and combined 718,484 *S. tropicalis* ESTs (from a total of 1,271,375) into 14 distinct adult tissues and 4 embryological stages: brain, bone, eye, heart, kidney, liver, lung, thymus, pancreas, skin, spleen, fat body, ovary, testis, egg, gastrula stage, neurula stage, and embryo stage 62. Each library consists of at least 3,900 transcripts (on average 39,915) and each individual UniGene has its own set of unique ESTs. We used the proportion of transcripts of each gene divided by the number of transcripts in a given EST library as an estimate of its expression level to control for the different sizes of the EST libraries. For twenty-three outlier genes with high expression, we truncated the expression level to a value of 1% of the respective EST library to prevent these outliers from dominating the results. Of the set of genes for which sequence data were available, a total of 3,298 triads and 4,426 dyads also had expression data (at least one *S. tropicalis* EST read in at least one EST database).

### Expression summary statistics

To characterize expression patterns in the diploid species *S. tropicalis*, three non-independent summary statistics were calculated: total expression (*T*), expression intensity (*I*), and expression evenness (*E*). Total expression *T* is simply the expression level (the proportion of total EST reads of a gene in a given EST library, *L*
_{i}) summed across all EST libraries (*T* = Σ *L*
_{i}). Genes that are evenly expressed at moderate levels in many tissues have similar *T* to genes that are highly expressed in only a few tissues. We therefore introduced a measure of "intensity". Intensity is the mean expression level as seen by a gene, rather than by a tissue. Thus, we expect a gene that is highly expressed in only a few tissues to have a moderate total expression, but a high level of intensity. We calculate intensity as a weighted average of expression levels, where the expression levels themselves are the weights: *I* = Σ *L*
_{i}
^{2} /Σ *L*
_{i}.

Although *T* and *I* capture the desired information about gene expression levels and distribution, we added a measure of evenness to give our linear model more flexibility. We define the evenness *E* as *T*/*I*; this is a logical measure of the "effective" number of tissues in which a gene is expressed, and we consider this analogous to how broadly a gene is expressed. *E* is equal to (Σ *E*
_{i})^{2} /Σ *E*
_{i}
^{2}, and is therefore Simpson's diversity (equivalent to 1/Simpson's index). *E* therefore measures how evenly distributed gene expression is across different tissues. A gene with relatively even distribution across tissues will have an elevated *E* (will be broadly expressed), irrespective of the total expression level. *T* was calculated for 8,133 genes. For 409 genes that had no EST reads, *T* was zero, *E* was undefined and set as missing, and *I* was undefined but set to zero since it must approach zero as expression levels approach zero.

### Molecular evolution

We used 454 pyrosequencing of cDNA from *Pipa carvalhoi* and *Hymenochirus curtipes* to generate outgroup sequences for analysis of molecular evolution in *Silurana* and *Xenopus* (following the same protocol as in [43]). Sequences were assembled using gsAssembler and gsMapper (454/Roche) and the resulting contigs aligned to the triads and dyads using Perl scripts, BLAST and MUSCLE. This effort recovered sequences from portions of 2,157 genes from *P. carvalhoi* or *H. curtipes* that were used to root a phylogeny with sequences from *S. tropicalis* and either one ortholog or both ohnologs of *X. laevis*. New sequences from the outgroup species have been deposited in the Transcriptome Shotgun Assembly database (JP285961 - JP288098, JP297711 - JP297788).

We calculated rates of nonsynonymous (dN) and synonymous (dS) substitutions per site, and the dN/dS rate ratio of the *S. tropicalis* lineage. These statistics were calculated with the codeml program in the PAML package version 3.15 [44] using a maximum likelihood model that individually estimates each type of substitution rate for each branch. Sixty-five values of dN/dS were undefined due to a dS of zero. To cope with this, we instead used an adjusted dN/dS in our analysis, defined as dN/(dS+0.02). We made this adjustment a priori (looking only at dS values) and performed it on all genes for which we had sequence data. By choosing adjusted dN/dS as our measure of the relative strength of selection in our model, we are simply choosing a different proxy for this underlying effect so that we are able to make use of more of our data.

### Gene structure

Using JGI annotations for *S. tropicalis* EST data, we collected statistics on gene structure for 6,075 genes. These statistics relate to content and packaging of information; for each gene we scored the number of exons, the length of the protein-coding region, the total length of introns and the amino acid diversity (Shannon Index).

### Logistic regression, missing data and confidence intervals

Logistic regression is a generalized linear model used for binomial regression, providing a useful statistical framework with which to explore the impact of continuous and potentially non-independent variables on a binary outcome (here, whether or not both copies of a duplicate gene persist after WGD). Logistic regression was recently employed to identify predictors of paralog retention originating from both WGD and tandem duplications in *Populus* [45]. Whereas that study used genetic variables taken from the tetraploid species, we used the logistic regression to test the association between genetic variables measured in a diploid and the persistence of orthologous ohnologs in a tetraploid. Our goal was not to make a predictive model, but to use the linear model as a tool to study associations and make inferences about evolutionary mechanisms. This analysis was performed using R [46].

We divided each of our variables by its standard deviation; thus the regression coefficients are proportional to the overall estimated importance of an approximation of the relative influence that each variable has on the outcome - the odds of persisting as a duplicate gene. Some of our variables have missing data. To make possible a single analysis that jointly considered all of the data, we substituted each missing value with the grand mean of that variable. Variables were not re-normalized after substitution because this would affect the variance and inferred importance of the variable in the logistic regression. This substitution allows us to jointly consider all available data in one analysis. Because this replacement affects the standard regression assumption, we report P-values based on a permutation test of the model. For each variable, we fitted the model 2000 times - once with the original data, and 1999 times with the focal variable replaced by a random permutation of itself. The two-tailed permutation P-value is twice the proportion of these fits (including the original) whose coefficients for the focal variable are greater (respectively, less than) or equal to a positive (negative) coefficient from the original fit. Since we conservatively count the original fit, our two-tailed P-value from 2000 tests cannot be less than 0.001. The permutation P-values are similar to the standard P-values from the logistic regression (not shown) and in fact give exactly the same pattern of significance. An analysis including the 2,021 genes without missing data was also performed and does not change our overall conclusions: one variable is no longer significant (number of exons) and another is now significant (amino acid diversity). The direction of the correlation of our significant variables does not change, and the relative strength of the strongest predictors (total expression, evenness of expression, dN and dS) remains the same.