Data collection
Samples of Fundulus grandis were collected from eight estuaries over a range of time points prior to, and concurrent with, the Deepwater Horizon oil release event. Collection sites, from west to east, include: 1) Port Aransas, TX [27°45'59"N, 97°7'33"W]; 2) Cocodrie, LA [29°15' 13.89" N, 90°39' 45.91" W]; 3) Leeville, LA [29° 12' 43.37N, 90° 09' 08.37" W]; 4) Isle Grande Terre, LA [29°16'22.93"N, 89°56'41.87"W]; 5) Dauphin Island, AL [30°20' 04.92"N, 88° 07'57.21"W]; 6) Weeks Bay, AL [30°22'41.80"N, 87°50'19.60"W]; 7) Santa Rosa Island, FL [30°21'16.63"N, 87° 2' 46.92"W]; and 8) Florida State University Marine Station, FL [29°54'56.83"N, 84°30'39.07"W]. Three adults were sampled from Isle Grande Terre (Barataria Bay, LA) on June 28, 2010 -- a time when oil had already heavily infiltrated the area for at least two weeks [16] (exposed samples E1, E2, E3). Each of the three individuals discussed thus far were kept as separate biological replicate samples for sequencing. Two additional samples were composed of 6 and 8 adults pooled from 7 sites East or West of the Mississippi (locations1-3 and 5-8) collected prior to the oil spill (April 2008) (unexposed samples U1 and U2 respectively) (Table 1). Both genders are represented in both exposed and unexposed samples, and differences in size are not expected to be a major factor in the current analysis [24–26]. In all cases the RNA was purified from isolated livers. Several of these sites have a recorded history of hypoxic events as indicated in Additional file 1: Table S2. None of the collection sites are in heavily industrialized areas with known histories of pollution. While small oil seeps occur throughout the Gulf of Mexico, we do not expect any to be exposed at the same levels as the south Louisiana fish during June 2010.
RNA isolation
Oil exposed samples
RNA from adult male GT fish was isolated as reported in [16]. Briefly, liver tissues were dissected from fish field sites immediately following capture and preserved in RNAlater (Ambion, Austin, TX, USA). Total RNA was extracted using TRIzol reagent (Invitrogen, Carlsbad, CA, USA), purified using RNeasy spin columns (Qiagen Inc., Valencia, CA, USA), and RNA quality was verified using microfluidic electrophoresis (Experion instrument and reagents, Bio-Rad Laboratories, Hercules, CA, USA).
Control samples
RNA was isolated from livers stored in RNALater (20°C) by homogenizing tissue using ceramic beads homogenizer (Bio101, Savant; Carlsbad, CA). Tissues were place in screw-top 2ml tube with 750 ul of chaotropic reagent, 1.25 g ceramic beads and violently shaken for 40 s. Solution was centrifuged and 500 ul of supernatant placed in RNAse-free microfuge tube, with 50 ul 2M NaOAc pH 4.0, vortexed; 500ul of acidic phenol was added and vortexed again; then 100 ul of Chloroform:isoamyl alcohol (24:1) was added and vortexed. This solution was centrifuged at 4°C for 20 min. The upper-phase was placed in RNAse-free microfuge tube and RNA was precipitated with equal volume of isopropanol. RNA pellet was resuspended in chatropic reagent and purified using Qiagen RNA columns following manufacture’s procedures.
The quality of all mRNA samples were verified by electrophoresis on BioAnalyzer (Agilent, Santa Clara CA). The RIN scores for all samples were in the range of 8.3-9.7 (Additional file 1: Table S4).
Short read processing
All RNA samples were sequenced in separate lanes on an Illumina Genome Analyzer (Expression Analysis, Inc. Durahm, NC). The resulting raw data files contained between 60-75 million 60 bp read pairs each for a total of 501,847,804 overall reads (deposited to NCBI short read archive, accession: SRA053469). We filtered and trimmed the reads based on quality scores using an in-house filtration algorithm that removed low scoring sections of each read and preserved the longest remaining fragment based on the following method. First, any reads with uncalled bases were rejected. The phred quality score of 2 encoded in fastq format as a ‘B’ is a special flag indicating that the results downstream of that position are untrustworthy. Therefore, as a second step, portions downstream of ‘B’ quality scores were removed. Finally, reads were broken apart anywhere the quality score value was 10 or below or where the average score of a position and its two neighbors was 20 or below. The largest remaining fragment of each read was kept (provided it was sufficiently long i.e., 49 bp or more) and the rest were discarded. Finally reads that lost their mate pair were moved into a single-end file and the integrity of the remaining read-pairing information was maintained. Any of the Perl scripts we developed to perform these steps are available on request.
Reference transcriptome sequence assembly
We used the Velvet short read assembly package to perform the transcript assembly and made use of the Columbus and Oases extensions to Velvet [27, 28] (Velvet version 1.1.04 and Oases version 0.1.21). As a first step we aligned the filtered and trimmed short reads to F. heteroclitus (a sister species to F. grandis) ESTs (retrieved from GenBank Feb 17, 2011) using the Bowtie short read aligner [29] (Bowtie version 0.12.7). The resulting alignment file (in SAM format), consisting of the read sequences and aligned locations if available, was then passed to Velvet along with the reference ESTs. We scanned a range of k-mer sizes from 21 to 49, and for each size we produced transcripts using Oases. Based primarily on the N50 value profile we selected the k = 27 assembly for further use in these studies (Figure S1). This full set of assembled sequences have been deposited to the Transcriptome Shotgun Assembly (TSA) database at GenBank (accession numbers JW540070 - JW662113) and are also available at xiphophorus.org under ‘Publication Supplements’.
We used Bowtie to map the full set of filtered reads to 203,252 assembled transcriptome sequences, then determined the number of reads mapped to each transcript. The settings we used allowed any read to map with up to 2 mismatches, and reads that could match in more than one location were randomly assigned. Any transcript with fewer than 10 mapped reads was rejected. This process eliminated 41% of the full transcript set resulting in a set of 120,725 transcripts that were used for further analyses.
Analysis of differential expression
We again employed the Bowtie short read alignment program to map each sample of short reads independently to the refined set of reference transcripts. A custom Perl script then determined the number of reads mapped to each contig from the alignment file. A second Perl script then compiled the number of reads per contig per sample into a tabular format. The first column of the data file contains the transcript identifier, and each subsequent column has the number of fragments mapped to that transcript in each sample. We then used the DESeq package (ver 1.4.1, Bioconductor ver. 2.8, R ver. 2.13) to determine differential expression from the compiled tabular data [30]. DESeq uses a model based on the negative binomial distribution to determine significance and was developed specifically to meet the challenges of working with RNA-Seq data. The biological replicates were advantageous because DESeq is then able to determine a much better variance estimate for each transcript improving the statistical power of the experiment. The diagnostic plots produced by DESeq indicated no major problems with the data (Additional file 1: Figures S2, S3, and S4). A significance level of P < 0.01 was used to select differentially expressed sequences. Although there was a newer version of DESeq at the time, the one we employed was stable and used methods which had been published in the paper cited whereas the latest version was being updated daily in the Bioconductor repository.
Sequence annotation
Matches for the reduced set of transcripts are sought from the non-redundant ‘nr’ blast database maintained by NCBI. We obtained the fasta format of the nr database on May 12, 2011, and prepared it for use with MPI-BLAST [31]. We used MPI-BLAST (ver 1.4.0) to query the F. grandis transcripts against nr and kept only 10 hits per sequence with an expect value threshold of 1e-10 using blastx. The results were processed by Blast2GO [32–34] in order to quickly assign a likely identity to each sequence.
Comparison to microarray data
Microarray target sequences were queried against the assembled transcript sequences using blastn with a minimum expect-value threshold of 1e-10. For each microarray target, the hit with the highest bit score was selected as a match. RNA-Seq analyses were then compared to microarray data for F. grandis that compared samples from the exposed collection site (Isle Grande Terre, LA) both before and after the arrival of contaminating oil at that site, and compared the exposed site response to five different unexposed populations [16]. The microarray data were analyzed by mixed-model ANOVA which compared the response pre-oil to post-oil (3 timepoints) and between exposed and unexposed populations (6 populations) using JMP-SAS.