A comparative genomics approach was used to generate sequence-specific information for a high-producing CHO cell line with the goal of making this data publicly available. The development of CHO genomic resources will benefit not only cell line engineering efforts to enhance biopharmaceutical production but other areas of research utilizing CHO cells, such as the use of radiation hybrid mapping for comparative genomic analysis . The analysis presented here demonstrates the potential of applying Illumina sequencing in the development of CHO genomic resources. Integration of genomic sequences derived from multiple next-generation technologies, such as 454 and Illumina sequencing, with those derived from Sanger sequencing enhance genomic coverage . The inclusion of long paired-end or mate-paired libraries, with varied insert sizes, coupled with the high-throughput of next-generation sequencing technologies should also provide not only sequence but structural information required for de novo assembly of the CHO genome.
Neither the short reads nor the reference genomes were repeat-masked prior to alignment. A prevalent feature of mammalian genomes is the high content of repetitive sequences. Approximately 46% of the human genome, 37% of the mouse genome, and 40% of the rat genome are repetitive sequences [27, 28]. Repeat-masking either the short reads or reference genome would discard information about a significant fraction of the genome and would reduce coverage in an uneven manner . Recent work suggests that endogenous repetitive structures on CHO chromosomes may promote gene amplification and increase the stability of the amplified gene . Including repetitive regions in the assembly and analysis may help identify genomic structures associated with hyperproductive CHO cell lines.
Several studies employed a similar approach to successfully generate genomic resources for non-model organisms from low-coverage data [25, 34, 35]. There are inherently some limitations to a reference-guided alignment and analysis regarding sequence similarity and genomic structure. MAQ allows up to 2 mismatches within the first 28 bp of each read and does not allow for gaps in the alignment . Short reads derived from regions with less than 94% identity to the reference sequence may not be aligned . This may account for the low percentage of total CHO reads aligned to either the mouse or rat genome and suggests that the CHO contigs presented here represent highly conserved regions between CHO cells and mouse or rat. In an initial genomic sequencing of the turkey using Illumina technology, only one-third of the short 35 bp reads could be directly aligned to the chicken genome, a closely related species, suggesting that a large portion of the short reads may not be expected to align in this type of analysis .
Additionally, during alignment, the sequenced genome is scaffolded onto the reference, so the structure of the final consensus sequence may not be representative of the true genomic architecture . New methodologies are being developed to improve the consensus genomic sequences produced by reference-guided alignment [37, 38]. CHO cell lines commonly used in biopharmaceutical production have a reduced chromosome number compared to primary Chinese hamster cells . These cell lines also undergo genomic rearrangements as a result of amplification procedures used to develop high-producing cell lines [13, 14]. Therefore, the genomic structure of the Chinese hamster may not be representative of the individual cell lines and analysis of specific CHO cell lines may provide a better understanding of the structural changes associated with hyperproductivity.
Of particular interest in CHO cell lines is examining the relationship between the location of the amplified gene and productivity of the cell line. BAC libraries were recently used to examine the site and structure of the transgene vector in gene-amplified cell lines [3, 4]. The DHFR amplicon is large, up to several hundreds of thousands of nucleotides, and may contain repeated segments of the endogenous CHO genome [3, 4, 39]. The small lengths of the CHO contigs makes it unlikely that any contig will span both the DHFR amplicon and the host genome. Additionally, the transgene vector sequence is not present in the reference genome used during alignment. This makes it difficult to determine the integration site of the DHFR transgene vector in this analysis. A greater coverage of the CHO genome to permit de novo assembly of the reads will facilitate determining the integration site and copy number of the DHFR amplicon in this cell line. Increased coverage and refinement of the CHO genome will also enable detection of other copy number variants, such as insertions and deletions, and accurate SNP identification to assist cell line engineering efforts [40–42].