Building ortholog annotations
Human-chimpanzee orthologs and human-macaque orthologs were generated separately, based on human Ensembl annotation (v64) [4], human genome (hg19) [17], chimpanzee genome (panTro2) [18] and macaque genome (rheMac2) [20]. The pair-wise alignment files were downloaded from UCSC genome browser (http://genome.ucsc.edu/). Gene annotation of chimpanzee and macaque used for comparison were also obtained from Ensembl (v64) database (http://www.ensembl.org).
WGA annotation
To keep syntenic information, human exons from all transcripts were lifted to genomic locations on reference genome of chimpanzee and macaque by liftOver tool [14], using pair-wise alignment files downloaded from UCSC genome browser. The liftOver parameter “-minMatch” was set to 0.98 for chimpanzee and 0.913 for macaque, based on bootstrapping (Supplementary Methods & Additional file 2: Figure S2). The lifted exons on reference genome of chimpanzee and macaque were then mapped back to human reference genome, using liftOver tool. During the reciprocal mapping, the following exons/transcripts were excluded: (i) exons cannot be lifted from human to the other species were filtered out; (ii) exons cannot be lifted back to the original genomic location of human genome were filtered out; (iii) transcripts with exons mapped to different chromosomes or strands were filtered out. The process can be completed in one step by running AnnoConvert in our pipeline.
POED annotation
The orthologous exons of human, chimpanzee and macaque were downloaded from Primate Orthologous Exon Database (POED, Version 2; http://giladlab.uchicago.edu/orthoExon/). To be consistent with other databases, we converted genomic coordinates on chimpanzee genome panTro3 to panTro2 by liftOver.
XSAnno annotation
Step1: The first step is the same as how we build WGA annotation.
Step2: Exons from WGA annotation were aligned to the reference genomes of both the same and the other species by BLAT [15]. Percent identity (PID) and percentage of aligned length (PL) were calculated as measures of local alignment. The thresholds of inter-species and intra-species PID and PL were chosen separately to maximize the number of exons retained (Supplementary Methods and Additional file 2: Figure S3). The inter-species PID and PL were selected to filter out exons without unique, highly conserved orthologs. For human and chimpanzee, the inter-species PID and PL were both set to 0.95. For human and macaque, the inter-species PID and PL were both set to 0.9. Exons that were not aligned to the same genomic location as WGA annotation or were aligned to multiple genomic locations using current cutoff were removed. The intra-species PID and PL were selected to filter out exons with highly conserved regions, which may cause ambiguity in mapping. For both chimpanzee and macaque, the intra-species PID was set to 0.97 and the intra-species PL was set to 0.95. Exons that were aligned to multiple genomic locations of their own reference genome at current cutoff were filtered out. The process can be finished by running BlatFilter combined with R [21] functions of threshold determination and filtering.
Step3: To eliminate exons and genes with large inter-species difference in mappability, we generated simulated RNA-seq data with the setting that all transcripts are equally expressed, using simNGS. To run simNGS in parallel with Step Two, we generated simulated HiSeq 100 bp single-end reads based on WGA annotation and then calculated expression only for exons in WGA + LA annotation. Coverage of all transcripts was set to 10X. Ten simulated RNA-seq fastq files were generated for each species. The simulated reads were then mapped to their own genome, using TopHat [19] without providing junction annotation. The number of reads mapped to each exon was counted and used for differential expression analysis with DESeq package [13] for R. Exons and genes that are significantly different between species (FDR < 0.01) were filtered out. Besides, we filtered out genes with length smaller than one third of original length and shorter than 1 kb.
The example scripts to generate simulated reads and to filter exons and genes are available in our pipeline.
Analysis of published data
Affymetrix Human Exon Junction array data were downloaded from GSE15665. Gene expression was estimated using probes perfectly conserved in nonhuman primates and normalized by quantile normalization as described in the original study.
RNA-seq data were downloaded from SRA023554.1. RNAs were sequenced as 36 bp (human) and 35 bp (chimpanzee and macaque) single-end reads by Illumina GAII. Reads were aligned by TopHat, allowing 2 mismathes, without providing transcriptome annotation. Read count and RPKM of genes were calculated by RSEQTools [22].
Ten lanes of simulated RNA-seq data per species were generated by simNGS, using different sets of annotations. DIM genes were identified by DESeq with FDR < 0.01.
To compare inter-species differences of DIM genes with that of genes with consistent cross-species mappability, we performed the F test for equality of variances. In detail, if mappability affects estimation of inter-species differences, we expect larger variance in inter-species differences of DIM genes than in inter-species differences of consistent genes. ( and represent the sample variances of inter-species differences of DIM genes and consistent genes, respectively). The F test was conducted using R function var.test, with alternative hypothesis .
RNA sequencing and data analysis
RNA extraction
Postmortem human brain specimens were obtained from tissue collections at the Department of Neurobiology at Yale University School of Medicine and the Clinical Brain Disorders Branch of the National Institute of Mental Health. Tissue was collected after obtaining parental or next of kin consent and with approval by the institutional review boards at the Yale University School of Medicine, the National Institutes of Health, and at each institution from which tissue specimens were obtained. Tissue was handled in accordance with ethical guidelines and regulations for the research use of human brain tissue set forth by the NIH (http://bioethics.od.nih.gov/humantissue.html) and the WMA Declaration of Helsinki (http://www.wma.net/en/30publications/10policies/b3/index.html). Appropriate informed consent was obtained and all available non-identifying information was recorded for each specimen. Specimens range in age from 21 to 40 years. The postmortem interval (PMI) was defined as hours between time of death and time when tissue samples were frozen.
All experiments using nonhuman primates were carried out in accordance with a protocol approved by Yale University’s Committee on Animal Research and NIH guidelines.
DFC tissue samples were dissected from postmortem adult chimpanzee and macaque brains using the criteria previously described [23, 24]. Human DFC RNA-seq data were generated as a part of the BrainSpan project (http://www.brainspan.org). Together, the RNA-seq dataset includes DFC samples from 5 humans, 5 chimpanzees, and 3 macaques. A bead mill homogenizer (Bullet Blender, Next Advance) was used to lyse the pulverized DFC tissue samples. Each pulverized tissue sample was transferred to a chilled safe-lock microcentrifuge tube (Eppendorf). A mass of chilled stainless steel beads (Next Advance, cat# SSB14B) equal to the mass of the tissue was added to the tube. Two volumes of lysis buffer were added to the tissue and beads. Samples were mixed in the Bullet Blender for 1 min at a speed of six. Samples were visually inspected to confirm desired homogenization and then incubated at 37°C for 5 min. The lysis buffer was added up to 0.6 ml, and samples were mixed in the Bullet Blender for 1 min. Total RNA was extracted using RNeasy Plus Mini Kit (Qiagen) for mRNA-sequencing. Each sample was subjected to a DNase treatment (TURBO DNase, Ambion) as per manufacturers’ instructions.
Optical density values of extracted RNA were measured using NanoDrop (Thermo Scientific) to confirm an A260:A280 ratio above 1.9. RIN was determined for each sample using Bioanalyzer RNA 6000 Nano Kit (Agilent), depending upon the total amount of RNA.
Library preparation for mRNA-sequencing
cDNA libraries were prepared using the mRNA-Seq Sample Kit (Illumina) as per the manufacturer’s instructions with some modifications. Briefly, polyA RNA was purified from 1 to 5 μg of total RNA using (dT) beads. Quaint-IT RiboGreen RNA Assay Kit (Invitrogen) was used to quantitate purified mRNA with the NanoDrop 3300. Following mRNA quantitation, 2.5 μl spike-in master mixes, containing five different types of RNA molecules at varying amount (2.5 × 10−7 to 2.5 × 10−14 mol), were added per 100 ng of mRNA [25]. The spike-in RNAs were synthesized by External RNA Control Consortium (ERCC) consortium by in vitro transcription of de novo DNA sequences or of DNA derived from the B. subtilis or the deep-sea vent microbe M. jannaschii genomes and were a generous gift of Mark Salit at the National Institute of Standards and Technology (NIST). These were used both to track the brain regions, species and to normalize expression levels across experiments. Each sample was tagged by adding a pair of spike-in RNAs unique to the region from which the sample was taken. Also, an additional three common spike-ins were added for controlling sequencing error rates, which is not influenced by SNP existence (Additional file 1: Table S7). Spike-in sequences are available at http://archive.gersteinlab.org/proj/brainseq/spike_in/spike_in.fa. The mixture of mRNA and spike-in RNAs were subjected to fragmentation, reverse transcription, end repair, 3′– end adenylation, and adapter ligation to generate libraries of short cDNA molecules. The libraries were size selected at 200 – 250 bp by gel excision, followed by PCR amplification and column purification. The final product was assessed for its size distribution and concentration using Bioanalyzer DNA 1000 Kit.
Sequencing
We used Illumina’s Genome Analyzer IIx (GAIIx) for mRNA-sequencing by loading one sample per lane. For mRNA-sequencing, the library was diluted to 10 nM in EB buffer and then denatured using the Illumina protocol. The denatured libraries were diluted to 12 pM, followed by cluster generation on a single-end Genome Analyzer IIx (GAIIx) flow cell (v4) using an Illumina cBOT, according to the manufacturer's instructions. The Illumina GAIIx flow cell was run for 75 cycles using a single-read recipe (v4 sequencing kits) according to the manufacturer’s instructions.
Mapping of mRNA-seq reads
We chose TopHat to map RNA-seq reads due to its ability to map junction reads without depending on annotation. The reference genomes used were the same as those for ortholog identification. Only uniquely mapped reads with at most 2 mismatches were included to calculate exon/gene read number and reads per kilobase per million (RPKM) [26].
Testing the effects of filters
To test the effects of each filtering step, we first compared the inter-species variation of genes remained with the ones filtered out in each filtering step. The inter-species log2-fold-change (log 2FC = log 2(RPKMsp1 + 1) − log 2(RPKMsp2 + 1)); sp1 and sp2 stand for Species 1 and Species 2, respectively) were calculated for each gene, using WGA annotation, WGA+LA annotation, and XSAnno annotation, respectively. To test the effects of local alignment, we compared the distribution of inter-species log2FC of genes remained in WGA+LA annotation from WGA annotation with that of genes excluded in WGA+LA annotation. Similarly, the distribution of inter-species log2FC of genes remained in XSAnno annotation was compared with the distribution of genes filtered out in XSAnno annotation from WGA+LA annotation. We conducted the F test for equality of variances as used in analyzing the published dataset.
To compare the inter-species variation of included exons with excluded exons from the same transcripts, we summarized the inter-species variation of in and out exons by calculating the mean inter-species log2FC. In other words, for a specific gene, exonFCin = mean (log2FCin); exonFCout = mean (log2FCout). For each gene, the difference between exons included and excluded was then calculated as In-Out = |exonFCin| - |exonFCout|. We then performed the paired Wilcoxon signed-rank test with alternative hypothesis |exonFCin| < |exonFCout| to test whether inter-species difference of in-exons are smaller than that of out-exons.
Differential expression analysis
Differential expression analysis were performed between human and chimpanzee and between human and macaque, respectively, with DESeq [13] package for R. Genes were identified as DEX, if FDR < 0.01.
The list of human-chimpanzee DEX genes were then intersected with the list of human-macaque DEX genes. Genes with the same direction of change (up or down) in human comparing with other two species were selected as human DEX genes.
Validation by droplet digital PCR
Thirty genes in the human DEX gene list were selected for validation, including 10 most significant human DEX genes only in WGA annotation, 10 most significant human DEX genes in WGA+LA annotation but not in XSAnno annotation, and 10 most significant human DEX genes in XSAnno annotation (Table S4).
We employed droplet digital PCR (ddPCR) to reliably quantify gene expression. An aliquot of the total RNA that was previously extracted from 3 randomly selected brains per species was used for secondary validation through ddPCR analysis. One μg of total RNA was used for cDNA synthesis using SuperScript III First-strand synthesis Supermix (Invitrogen) and subsequently diluted with nuclease-free water. Custom gene-specific primers and probe for each gene of interest were designed using NCBI/Primer-BLAST (http://www.ncbi.nlm.nih.gov/tools/primer-blast/) and PrimerQuest tool (IDT). In detail, primer pairs were designed in genomic regions that are orthologous (or identical, if the gene is conserved highly across three species), as well as to be separated by at least one intron on the corresponding genomics DNA with a targeted amplicon size at 70 bp to 200 bp. We also allowed primers to amplify mRNA splice variants that are annotated in RefSeq, while did not allow them to contain known SNPs. The probe was designed by PrimerQuest tool (IDT) by applying the above pre-designed PCR primers. We opted to design identical probe sequence for each species, but if the target region is less conserved across three species, we had to design slightly different probes for each species. IDT’s proprietary ZEN internal quencher was applied on top of a 3′ quencher (IBFQ) and a 5′ fluorophore (FAM or HEX) probe labeling. ddPCR was carried out using the Bio-Rad QX100 system. After each PCR reaction mixture, consisting of ddPCR master mix and custom primers/probe set, was partitioned into 15,000–20,000 droplets, parallel PCR amplification was carried out. Endpoint PCR signals were quantified and Poisson statistics was applied to yield target copy number quantification of the sample. Two color PCR reaction was utilized for the normalization of gene expression by the housekeeping gene TBP. Table S8 in Additional file 1 provides sequences of primers and probes used for the validation.
Gene expression was calculated as the ratio of target genes to the housekeeping gene TBP. Wilcoxon signed-rank tests were performed to identify differentially expressed genes between human and chimpanzee and between human and macaque, separately. Genes were considered as DEX if p ≤ 0.1.