The CQ method
The chromosome quotient is the normalized ratio of female to male alignments to a given reference sequence. For a given sequence Si, CQ
(Si) = F(Si)/M(Si), where F(Si) is the number of alignments from female sequence data to Si, and M(Si) is the number of alignments from male sequence data to Si.
To calculate chromosome quotients we wrote a program we call CQ-calculate (Available from: http://sourceforge.net/projects/cqcalculate/files/CQ-calculate.pl/download). CQ-calculate was written in Perl and designed to rapidly calculate chromosome quotients. There are three inputs to CQ-calculate: reference sequences, male-specific sequence data, and female-specific sequence data. The reference sequences are required to be in FASTA format. The male and female sequence data can be in either FASTQ or FASTA format. Preferably, the male and female sequence data should be from either highly inbred populations, or from a pool of many individuals to adequately sample genetic variation. For the best results, the male and female sequence data should be from the same colony or population to minimize the risk of bacterial or viral contamination exclusive to either the male or female data. The male and female sequence data is aligned to the reference sequences using the ultrafast read aligner Bowtie . CQ-calculate uses stringent alignment criteria requiring the entire read to align with zero mismatches. To account for differences in coverage between the male and female sequence data, the chromosome quotients of the reference sequences are normalized to the median chromosome quotient of known autosomal sequences.
CQ-calculate can run on modest computers, and the only software requirements are: Linux, Perl, BioPerl, and bowtie. The time CQ-calculate takes to run is dependent on the genome size and coverage of the male and female sequence data. On a server running Ubuntu 12.04 with an Intel 3930 K six-core processor, it took approximately five minutes to calculate the chromosome quotients for all the sequences in the An. stephensi genome. For the much larger human genome, CQ-calculate took less than one hour. Memory requirements are dependent on the size of the reference genome but are typically minimal. CQ-calculate is easy to run, and is applicable to any heterogametic genome where separate male and female sequence data are available.
The CQ method uses the number of alignments from male and female sequence data to determine whether a sequence is Y-linked. The number of alignments affects the confidence with which a sequence can be classified as Y-linked. Thresholds were examined from one male alignment to 50 male alignments. Increasing the threshold for male alignments decreases the number of false positive results but also increases the number of false negatives (Additional file 4: Table S1). A threshold of 30 male alignments was chosen as it balances the number of false positives with the total number of sequences that can be analyzed. However, the threshold is flexible and can be increased for higher confidence in Y-linkage or decreased for a lower rate of false negatives.
We noticed that there are sequences with thousands of alignments from male and female data, but that still have many more male alignments than female alignments leading CQ-calculate to identify these sequences as Y-linked. Some of these sequences are known to be located on the autosomes or X, so we hypothesize that these are highly repetitive sequences with copies on the autosomes or X, but have many more copies on the Y. These highly repetitive sequences can be removed by setting a threshold for female alignments. In this study, we set a stringent threshold for female alignments of 30. Thus, to classify a sequence as Y-linked using the CQ method it must have a chromosome quotient less than 0.3, more than 30 alignments from male data, and less than 30 alignments from female data.
The coverage of the next-generation sequence data can affect the calculation of chromosome quotients. Since a threshold of alignments is required to classify a sequence as Y-linked, higher coverage male and female sequence data leads to the identification of more Y-sequences. We have tested a range of coverages and found that chromosome quotients can still be accurately calculated with as low as 5× coverage, but at this low coverage the number of Y sequences identified is reduced, and the false positive rate is increased. Above 10× coverage, many more Y sequences can be identified. With more than 20× coverage, many short Y sequences can be identified with a low rate of false positives.
The CQ method positive control
We tested the CQ method on the genomes of H. sapiens and D. melanogaster. Since repeats can obscure Y sequences, we downloaded the softmasked reference genomes of H. sapiens (hg19 assembly) and D. melanogaster (dm3 assembly) from the USCS Genome Bioinformatics website. In a genome that is softmasked, repetitive sequences identified by RepeatMasker are replaced with lowercase nucleotides. We removed these repeats, fragmenting the genomes into smaller pieces. Fragments shorter than 250 bases were removed to mitigate false positives. The number of sequences resulting from the fragmentation, and the N50 size of the fragments were calculated (Additional file 5: Table S2).
H. sapiens male and female specific next-generation sequence data from a single male and female were downloaded from the 1000 Genomes Project [HG00234 and HG00235] . Male and female D. melanogaster next-generation sequence data for pooled five day old mated adults was downloaded from the NCBI Sequence Read Archive [SRA: SRP007888]. The coverage of the H. sapiens and D. melanogaster data was calculated (Additional file 6: Table S3). Chromosome quotients were calculated for all the fragments of the H. sapiens and D. melanogaster genomes and then normalized to the median chromosome quotient of the known autosomal sequences.
Y gene finding in the Asian malaria mosquito An. Stephensi
454 sequencing was performed on pools of approximately 30 adult An. stephensi from the Indian wild type strain. Newbler 2.6 was used to assemble the An. stephensi 454 sequence data into contigs and scaffolds. The contigs of this assembly are available from the NCBI [GenBank: ALPR00000000]. The number of contigs and N50 contig size were calculated (Additional file 5: Table S2). Pools of approximately 30 male and female An. stephensi of the Indian wild type strain were sequenced using the Illumina Genome Analyzer II [SRA: SRP013838] (Additional file 6: Table S3).
An. stephensi sequences have been anchored to chromosomes using chromosomal in situ hybridization . The short probe sequences from FISH that hybridized to known autosome and X sequences were compared to the An. stephensi genome scaffolds using blastn requiring 95 percent identity. Autosome and X scaffolds were recovered. The scaffolds were then split into fragments by the ambiguous nucleotide N. Fragments shorter than 250 bases were removed. The chromosome quotients of these sequences were calculated.
Chromosome quotients were calculated for all the contigs from the An. stephensi genome using the male and female Illumina data mentioned above and normalized to the median chromosome quotient of autosomal sequences. Y genes were identified in the contigs with chromosome quotients less than 0.3 by comparison to transcriptome sequence data. Using blastn requiring 100 percent identity and an e-value less than 1×10-5 we compared the Y-linked contigs to transcriptome sequence data raw reads from An. stephensi Indian wild type strain. The time points compared were: 0–1 hour embryos, 2–4 hour embryos, 4–8 hour embryos, 8–12 hour embryos, mixed-instar larva, pupa, adult females, and adult males [SRA: SRP013839]. The number of reads for each time point was calculated (Additional file 6: Table S4).
We searched for evidence of the An. stephensi Y chromosome gene sYG2 in the An. gambiae genome using blastn and tblastx. Blastn searches with a word size seven and e-value threshold of 1×10-2 using sYG2 as the query against the An. gambiae PEST, M and S genome assemblies yielded no significant similarity. Additionally blastn searches with word size seven and e-value threshold of 1×10-2 using sYG2 as the query against the PEST, M, and S trace files revealed no significant alignments. Tblastx searches with an e-value threshold of 1×10-5 yielded no significant similarity to the An. gambiae genome.
Y gene finding in the African malaria mosquito An. gambiae
The AgamP3 genome assembly of the An. gambiae PEST strain was downloaded from VectorBase. The genome was divided into seven parts: the arms of the two autosomes, the X chromosome, fragments of the Y chromosome, and unmapped sequences referred to as UNKN. The repetitive sequences of the AgamP3 genome assembly were masked using RepeatMasker by VectorBase and indicated by lowercase nucleotides.
The repetitive sequences indicated by RepeatMasker were removed from the autosomes, X, Y, and UNKN sequences creating many smaller fragments. Sequences shorter than 250 bases were removed to mitigate false positives. The number and N50 size of the fragmented autosome, X, Y, and UNKN sequences were calculated (Additional file 5: Table S2). Illumina sequencing of male and virgin female An. gambiae was performed on pools of approximately 30 individuals from the G3 strain from the same colony [SRA: SRP014730] (Additional file 6: Table S3). Chromosome quotients were calculated for the fragmented An. gambiae autosomes, X, Y, and UNKN sequences using the Illumina sequence data mentioned above.
We were concerned that Sanger sequencing, which was used to sequence the An. gambiae genome, may be biased against the heterochromatic Y chromosome. We attempted to circumvent this bias by assembling our male G3 strain Illumina data and searching for Y sequences. We assembled the An. gambiae male Illumina sequence data using ABySS single-end assembly with the kmer setting 31 . The number of contigs and the N50 contig size of the assembly were calculated (Additional file 5: Table S2). Since the N50 contig size was so short for this assembly, no further fragmentation was deemed necessary. Chromosome quotients were calculated for all the contigs in this assembly using the male and female An. gambiae Illumina data.
We searched for Y genes in the sequences with chromosome quotients less than 0.3 using the raw reads from An. gambiae transcriptome sequence data. Using blastn requiring 100 percent identity and an e-value less than 1×10-5 we compared the Y-linked contigs to transcriptome sequence data from adult male An. gambiae [SRA: SRP014756]. Using the same parameters, we also compared the UNKN sequences that we inferred to be Y-linked by their CQs to the same transcriptome data (Additional file 7: Table S4).
We searched for the An. gambiae Y chromosome genes gYG1 and gYG2 in the An. stephensi genome using blastn and tblastx. Blastn searches with a word size of 7 and an e-value threshold of 1×10-5 using the sequence of gYG1 and gYG2 as a query yielded no similarity to the An. stephensi genome. Furthermore, using blastn we compared gYG1 and gYG2 to Illumina data from An. stephensi and we found no significant similarity using word size seven and an e-value threshold of 1×10-2. Tblastx searches with an e-value threshold of 1×10-5 only yielded significant alignments from a small repetitive part of gYG1 to the An. stephensi genome.
Molecular biology methods
Genomic DNA samples were isolated with Life Technologies DNAzol from male and virgin female mosquitoes from the Indian wild type strain of An. stephensi and the G3 strain of An. gambiae. Five male and female samples were prepared, each from five individuals. In the case of Y genes that have autosomal or X paralogs, primers for genomic DNA PCR were designed with the differences between the autosomal or X paralog and the Y sequence at the 3′ end of the primer. PCR was performed with Finnzymes Phire DNA Polymerase. RNA was isolated from embryos, larva, pupa, adult male, and adult female individuals with the Life Technologies mirVana RNA isolation kit using the total RNA isolation protocol. Complementary DNA was synthesized with Life Technologies SuperScript III RT. All primer sequences are available in the additional files (Additional file 8: Table S5). We used primer-sets that exhibited male-specific amplification to perform RT-PCR with the complementary DNA mentioned above as template with either Finnzymes Phire DNA Polymerase or TaKaRa rTaq. Rapid amplification of cDNA ends (RACE) was performed on sYG1, gYG2, and gYG3 using the SMARTer RACE cDNA Amplification Kit. The resulting sequences were assembled into full-length transcripts, and verified by sequencing complementary DNA. Digital PCR was performed with a QX100 Droplet Digital PCR System from Bio-Rad, on male and female genomic DNA with a probe that would hybridize to both sYG1 and AsA-bbx. A single copy autosomal gene, zeta DNA polymerase catalytic subunit, was used as the reference set as two copies per diploid genome. Chromosomal fluorescence in situ hybridization for sYG1 was performed on mitotic and polytene chromosomes using the method described in [42–44]. Phylogenetic analysis was performed with MrBayes . The alignment and parameters used to infer the phylogeny are provided in the additional files (Additional file 9).