Gene trap mutagenesis of hnRNP A2/B1: a cryptic 3' splice site in the neomycin resistance gene allows continued expression of the disrupted cellular gene

Background Tagged sequence mutagenesis is a process for constructing libraries of sequenced insertion mutations in embryonic stem cells that can be transmitted into the mouse germline. To better predict the functional consequences of gene entrapment on cellular gene expression, the present study characterized the effects of a U3Neo gene trap retrovirus inserted into an intron of the hnRNP A2/B1 gene. The mutation was selected for analysis because it occurred in a highly expressed gene and yet did not produce obvious phenotypes following germline transmission. Results Sequences flanking the integrated gene trap vector in 1B4 cells were used to isolate a full-length cDNA whose predicted amino acid sequence is identical to the human A2 protein at all but one of 341 amino acid residues. hnRNP A2/B1 transcripts extending into the provirus utilize a cryptic 3' splice site located 28 nucleotides downstream of the neomycin phosphotransferase start codon. The inserted Neo sequence and proviral poly(A) site function as an 3' terminal exon that is utilized to produce hnRNP A2/B1-Neo fusion transcripts, or skipped to produce wild-type hnRNP A2/B1 transcripts. This results in only a modest disruption of hnRNPA2/B1 gene expression. Conclusions Expression of the occupied hnRNP A2/B1 gene and utilization of the viral poly(A) site are consistent with an exon definition model of pre-mRNA splicing. These results reveal a mechanism by which U3 gene trap vectors can be expressed without disrupting cellular gene expression, thus suggesting ways to improve these vectors for gene trap mutagenesis.


Background
Gene entrapment has provided effective strategies for insertional mutagenesis of mammalian cells in culture. The mutagens permit direct selection of clones in which cellular genes have been disrupted and simplify the characterisation of genes associated with recessive mutations [1]. Mutagenesis of embryo-derived stem (ES) cells, coupled with in vitro genetic screens, has been widely used to ana-lyse gene functions in mice [1]. These have included screens for mutations in developmentally regulated genes [2][3][4][5], in genes regulated by extracellular agonists [6,7], and in genes encoding secreted and transmembrane proteins [8,9]. Characterized mutations include genes involved in intracellular trafficking [10], transcriptional regulation [11,12], signal transduction [7,8,11,[13][14][15], neural development [16] and neural wiring [17], and axial patterning [18,19]. The rapid expansion of the nucleic acid databases has had a tremendous impact on the identification of genes disrupted by gene entrapment. This has led to the development of tagged sequence mutagenesis, a process by which genes disrupted in ES cells are characterized at the nucleotide level prior to germline transmission [20][21][22][23][24].
Gene trap retroviruses developed in our laboratory contain a selectable marker in the U3 region of the long terminal repeat (LTR) of a replication-defective Moloney murine leukemia virus. Selection for U3 gene expression generates clones in which the provirus is positioned in or near exons of actively transcribed genes and is expressed on transcripts originating in the flanking cellular DNA [25]. The vectors appear to be effective mutagens. Singlegene mutation frequencies are 100-1000 fold higher in cells isolated after gene trap selection than in cells containing randomly integrated retroviruses [26]. These targeting frequencies also support the idea that retrovirus can integrate throughout the genome and that most, if not all, expressed genes can be disrupted. Finally, approximately 40% of inserts selected in ES cells result in obvious phenotypes following transmission into the mouse germline [27][28][29]. In the four cases examined, the virus appeared to induce null mutations [10,[30][31][32].
In order to best utilise gene traps in genetic studies, it is necessary to understand the factors that allow expression of the entrapment vectors and that determine whether expression of the occupied gene will be disrupted. This is particularly true for tagged sequence mutagenesis, where one would like to predict by sequence alone the effects of the targeting vector on cellular gene expression. For this, a representative number inserts must be characterised including those not associated with any discernible phenotype. Most previously analysed mutations were selected because of phenotypes observed after germline transmission, and thus are unlikely to reveal mechanisms that could allow expression of the entrapment vector without disrupting expression of the occupied gene.
The present study characterized a mutation in the 1B4 cell line induced by the U3Neo gene trap retrovirus [29]. This insert was selected for study because the provirus inserted into a widely expressed gene and yet no phenotype was observed in mice homozygous for the provirus. Further analysis revealed that the provirus integrated into an intron of murine homologue of the human hnRNP A2/B1 gene. The gene encodes two related nuclear ribonucleoproteins, hnRNP A2 and hnRNP B1, members of a large family of RNA binding proteins found associated with mammalian heterogeneous nuclear RNA [33]. The 1B4 cell line was isolated by infecting D3 ES cells with  the U3Neo gene trap retrovirus and selecting for G418 resistant clones [29]. 1B4 cells contain a single, intact provirus as assessed by Southern blot hybridization (data not shown). Sequences flanking the virus, isolated by inverse PCR, hybridized to a transcript of approximately 1.8 Kb, and were used to screen a PCC3 embryonal carcinoma cell cDNA library. A total of 55 positive plaques were identified among 1 × 10 6 plaques screened. Further analysis of ten cDNAs revealed two overlapping clones covering the entire 1.8 Kb transcript. The composite cDNA contained an open reading frame encoding a polypeptide of 341 amino acids (Figure 1). Comparison of the translated sequence to the GenBank database using the BLASTP program [34] revealed a significant match with human hnRNP A2/B1 [35]. The human and mouse proteins are identical except asparagine 287 in the human sequence was replaced by a threonine (Figure 1). The mouse cDNA sequence has been deposited in GenBank (accession number AF073993).

The 1B4 Provirus integrates into the hnRNPA2/B1 gene
In order to determine where the provirus inserted within the hnRNP A2/B1 gene, genomic DNA flanking the 1B4 provirus was sequenced. The flanking DNA isolated by inverse PCR extended to a HinfI site 226 nucleotides upstream and downstream of the provirus ( Figure 2). Sequences matching the cloned cDNA and the human hn-RNP A2/B1 cDNA extended 64 nucleotides downstream of the HinfI site, while the remaining 159 nucleotides did not match the cDNA sequence. A consensus 5' splice site was located at the point where the genomic and cDNA sequences diverged. Therefore the 1B4 provirus appeared to integrate 159 nucleotides into an intron of the hnRNP A2/ B1 gene. The flanking sequence has been submitted to GenBank (accession number AF073990).
The fact that the flanking genomic DNA hybridized to a single genomic DNA fragment suggested that the provirus inserted into the hnRNP A2/B1 gene and not into a related but uncharacterized gene. However, since the match was based on a relatively short stretch of exon, several experiments were performed to confirm linkage between provirus and the hnRNP A2/B1 gene. First, two primers complementary to hnRNP A2/B1 sequences located upstream of the provirus were used together with a neo specific primer in separate PCR reactions. In each case, the size of the amplified product was consistent with insertion of the provirus into the hnRNPA2/B1 gene (data not shown). Second, cDNA sequences predicted to lie downstream of the integration site were used to probe Southern blots. The 3' hnRNP A2/B1 cDNA probe hybridized to a 18 kB EcoR1 fragment corresponding to the wild type gene and to a 22 kB fragment in DNA from mice

Figure 1
Sequence of the murine hnRNP A2/B1 gene. The complete nucleotide sequence and predicted amino acid sequence of the murine hnRNP A2/B1 cDNA was determined from two overlapping clones. The predicted amino acid sequence is identical to the predicted human protein with the exception of the substitution of threonine for asparagine at amino acid 287 (shaded). The site of provirus integration in 1B4 cells is indicated. Flanking sequences isolated by inverse PCR colinerar with the cDNA and a polyadenylation signal are underlined.

TAT GGT GGT GGA GGA CCT GGA TAT GGC AAC CAG GGT GGG GGC TAC GGA GGT GGT TAT GAC
AAC TAT GGA GGA GGA AAT TAT GGA AGT GGA AGT TAC AAT GAT TTT GGA AAT TAT AAC CAG 984 containing the 1B4 provirus (data not shown). The difference in the size of the wild type and mutant alleles resulted from the inserted U3Neo provirus Finally, as described below, U3Neo transcripts expressed in 1B4 cells are fused to upstream hnRNP A2/B1 sequences.

Provirus integration does not disrupt expression of the hn-RNP A2/B1 gene
Of the sixteen U3 gene trap proviruses selected in ES cells that we have introduced into the germline, six resulted in obvious phenotypes (typically embryonic death) when bred to homozygosity [10,21,[28][29][30][31][32]. In cases where no obvious phenotypes are observed, it is important to determine if the insert did not disrupt gene expression or if the gene is dispensable. Inheritance of the 1B4 provirus followed a Mendelian distribution and no phenotypic changes were observed. Among the 58 offspring analyzed after crossing mice heterozygous for the 1B4 provirus, 13 failed to inherit the provirus, while 32 and 13 were heterozygous and homozygous for the provirus, respectively ( Figure 3). A representative Southern blot used to genotype offspring is shown in Figure 3.
To test whether the 1B4 provirus disrupts expression of the hnRNP A2/B1 gene, RNA from wild-type mice and mice homozygous for the 1B4 provirus were analyzed by Northern blot hybridization, using hnRNP A2/B1 cDNA probes derived from sequences upstream and downstream of the integration site. All tissues from wild type mice expressed a single,  GACAAGGTTAAGATGTTGCCCTCTGACCACCATCATGTAAACT heterozygotes or wild type embryos. RNA isolated from these MEFs was used to quantify the extent of message reduction in 1B4 homozygous cells by Northern blot hybridization. When using probes derived from cDNA sequences downstream of the integration site northern blot analysis revealed a 50% reduction of transcripts in mutants compared to wild type using the glyceraldehyde 3-phosphate dehydrogenase (GAPDH) message as an internal standard (Figure 4).

Splicing and polyadenylation of hnRNP A2/B1-Neo fusion transcripts
Each LTR of the U3Neo provirus contains sequences for 3' processing and polyadenylation. Continued expression of hnRNP A2/B1 transcripts suggests that use of the viral poly(A) sites is less efficient than removal of the intron in which the provirus resides. To determine whether mutation of viral 3' processing signals was responsible for continued hnRNP A2/B1 expression, a 500 base pair region spanning the polyadenylation signal in the 5' LTR was amplified from integrated provirus DNA and sequenced. However, the sequence of the PCR product was identical to the wild-type Moloney murine leukemia virus LTR (data not shown).
The question remained as to how hnRNP A2/B1-neo fusion transcripts are expressed. Previous Northern blot analysis found high levels of a 2.3 Kb fusion transcript in 1B4 cells [29], approximately the size expected for hnRNPA2/neo fusion transcripts terminating in the 5' LTR. One possibility is that fusion transcripts may combine the proximal upstream hnRNP A2/B1 exon, 5' splice site, flanking intron and 5' LTR into a single, terminal exon. However, this possibility contradicts current models of exon definition in which exons in pre-mRNA are first defined by proteins interacting across exons and then processed as relatively autonomous units. Alternatively, the proximal hnRNP A2/B1 exon may maintain its autonomy and splice to a cryptic 3' splice site, located either in Neo or in the adjacent intron.
To distinguish between these alternatives, hnRNP A2/B1/ Neo fusion transcripts expressed in MEF and ES cells were analyzed by reverse transcriptase PCR (RT-PCR). A primer complementary to the Neo gene (NeoA) was used to prime first strand cDNA synthesis. Primers complementary to adjacent hnRNP A2/B1 exon sequences (PR1 or PR2) and a neo-specific primer (NeoB) were used to amplify transcripts extending from the hnRNP A2/B1 gene into the provirus ( Figure 5A). Transcripts extending through the 5' splice site, proximal intron and into the provirus would produce RT-PCR products of 917 and 598 nucleotides with PR1 and PR2, respectively. As shown in Figure  5B, the size of the major PCR product from each reaction was significantly smaller than expected for transcripts colinear with the flanking DNA. Moreover, the major PCR products did not hybridize to a U3-specific oligo probe ( Figure 5C). Three independent RT-PCR products were cloned from separate amplification reactions and sequenced. As shown in Figure 5D, all of these transcripts spliced from the proximal 5' splice site in the hnRNP A2/ B1 gene to a cryptic 3' splice site located in the Neo gene ( Figure 5D). Characteristic of 3' splice sites, the Neo splice site contained PyAG and a potential branch point The U3 probe also detected several minor RT-PCR products upon prolonged exposure ( Figure 5C). We were una-ble to clone these products due to their low abundance. However, they were smaller than expected for transcripts co-linear with the flanking DNA and may arise from cryptic splice sites in the flanking intron.

Discussion
Several large scale screens of insertion mutations in mouse embryo-derived stem (ES) cells rely on DNA sequence analysis to select mutations for germline transmission [21-24]. The process, designated "tagged sequence D mutagenesis", involves sequencing short segments of DNA isolated from each mutation to identify genes disrupted by the targeting vector. Sequence-based screens are faster and less expensive than phenotype-based screens, and provide centralised collections of characterised mutations available for germline transmission. However, to maximise the utility of tagged sequence mutagenesis, one would like to predict, from the sequence alone, the functional consequences of the inserted targeting vector on cellular gene expression. Toward this end, the present study characterised a mutation generated by insertion of the U3Neo gene trap retrovirus into an intron of the hnRNP A2/B1 gene. Expression of the Neo gene involved splicing of some hnRNP A2/B1 transcripts to a cryptic splice acceptor site located 28 nucleotides downstream of the neomycin phosphotransferase (NTP) initiation codon. Other hnRNP A2/B1 transcripts splice normally, removing the provirus along with other intron sequences. Therefore, expression of the hnRNPA2/B1 gene was only reduced to about half of wild type levels, and the mutation caused no obvious phenotype in mice. The present study identified a mechanism that allows expression of a U3Neo gene from a provirus positioned within an intron. The majority of hnRNP A2/B1-Neo fusion transcripts utilised a cryptic 3' splice site within the NPT coding sequence. This places the 5' proviral poly(A) site at the end of an alternative exon that can be utilised to produce fusion transcripts, or excluded to produce wild type transcripts. Since the initiation codon for NPT lies upstream of the Neo 3' splice site, NTP is expressed as a fusion protein in which the first 219 amino acids of hnRNP A2/B1 are appended to codon 10 of NTP.
Because of these results, we have analyzed the expression of 6 other U3Neo proviruses located in the introns of different genes. Transcripts in all but one of clone splice to the cryptic Neo splice site; while the one exception utilizes a cryptic splice site located in the proximal intron (E. White, G. Hicks, M. Roshon and H. E. Ruley, in preparation). Thus, utilization of the Neo cryptic site appears to provide the predominant mechanism by which the U3Neo gene is expressed following insertion into introns.
These results are consistent with an exon definition model in which splicing and polyadenylation require interactions between factors acting across exons [40,41]. This model predicts that polyadenylation signals are not recognised unless they can be defined as part of a 3' terminal exon. Accordingly, poly(A) sites are not efficiently recognised when positioned between 5' and 3' splice sites [38,39], and insertion of a 5' splice site into a 3' terminal exon suppresses polyadenylation [40,42]. Conversely, upstream 3' splice sites can enhance polyadenylation [43][44][45]. We find that the proximal hnRNP A2/B1 exon upstream of the provirus does not lose its identity; rather, the exon splices either to the Neo splice site or to the next hn-RNP A2/B1 exon. Moreover, the poly(A) site in the 5' LTR appears to be used exclusively in conjunction with a cryptic 3' splice site.
Utilization of the Neo 3' splice site is likely to have two important consequences with regard to the use of U3Neo vectors for insertional mutagenesis. First, insertion into an intron may not disrupt cellular gene expression. In the present study, levels of hnRNP A2/B1 transcripts were reduced only about two fold in homozygous mutant cells, and the relative amount of the A2/B1 protein in hnRNP complexes was unaffected (G. Dreyfuss, personal communication). This may explain the absence of an obvious phenotype in mice. Alternatively, other hnRNP proteins compensate for reduced levels of the A2/B1 transcripts, just as cells can tolerate severe reductions in the levels of the related hnRNP A1 protein [46,47].
Second, since the Neo 3' splice site is downstream of the NPT initiation codon, its use may skew the targeting to favor of those genes capable of splicing upstream exons inframe, to produce enzymatically active fusion proteins. The magnitude of the potential bias is difficult to assess. A variety of amino-terminal fusions maintain enzymatic activity (or produce enzymatically active breakdown products) including those fused to codon 12 of NPT [48][49][50][51][52]. Moreover, selection of resistant cell clones requires only minimal levels of Neo gene expression [53]. Still, genes providing the appropriate introns are expected to provide larger targets for gene trap mutagenesis than genes lacking such introns. This could contribute to the fact that 3 of 400 inserts characterised in an earlier study occurred in the same intron of the L29 gene [21].
The Neo 3' splice site contains a potential branch point sequence, and the sequence (CAGG) across the intron-exon boundary is optimal according to the scanning model of 3' splice site selection [54,55]. The Neo site differs from the typical 3' splice site in that it lacks a polypyrimidine tract; however, this feature is often missing from alternative splice sites [56]. Both the A of the branch point and the intron-terminal AG dinucleotide are considered invariant; therefore, one may be able to enhance the mutagenic efficiency of U3Neo vectors by altering these nucleotides. Alternatively, the problem may be avoided by using other selectable markers, assuming their sequences lack cryptic splice sites or by using gene trap vectors that rely on splicing to activate the expression of genes carried by the targeting vector. The latter vectors contain strong splice sites, either in front of [2,4] or behind [23] the entrapment cassette, allowing efficient expression from within introns.

Conclusions
hnRNP A2 and hnRNP B1 are members of a large family of RNA binding proteins found associated with mammalian heterogeneous nuclear RNA. The proteins are thought to participate in the processing mRNA precursors [33], and they can influence splice site selection and promote exon skipping in vitro [57][58][59]. Consistent with a fundamental role in RNA metabolism, the human and mouse hnRNP A2 sequences are highly conserved with only one amino acid difference out of 341 residues. However, since the 1B4 mutation did not ablate hnRNP A2/B1 gene expression, it is unlikely to be useful for studies of hnRNPA2/B1 gene function. While further analysis might uncover phenotypes associated with this hypomorphic mutation, detailed examination of either cells or mice seemed unjustified in the absence of any greater effect on gene expression. Our results illustrate the interplay between polyadenylation and splicing as predicted by an exon definition model. Moreover, the 1B4 mutation reveals a mechanism by which U3 gene trap vectors can be expressed without disrupting cellular gene expression and suggests ways to improve the vectors for gene trap mutagenesis.

Methods
Isolation of cDNA clones encoding the hnRNP A2/B1 protein DNA sequences (260 nt.) adjacent to the 1B4 provirus were isolated by inverse polymerase chain reaction (PCR) as reported elsewhere [29]. This flanking sequence was used as a probe to genotype mutant mice and cells by Southern blot hybridization and was also used to isolate cDNA clones encoding the murine hnRNP A2/B1 protein from a PCC3 embryonal carcinoma cell cDNA library. 55 hybridizing plaques were identified from a total of 1 × 10 6 plaques screened. Initial characterization of 10 strongly hybridizing plaques identified two overlapping clones corresponding to the full-length transcript.
Sequencing cDNA templates were subcloned into the pBluescript KSplasmid and completely sequenced from both strands. Plasmid DNA was isolated by the boiling lysis method [60], followed by precipitation with polyethylene glycol 8000. 5 µg of plasmid DNA was used in each sequencing reaction [10,61]. Initial sequences were determined by using T3 and T7 primers, and extended by using custom 17-18 nt primers (Gibco-BRL).

Reverse transcriptase PCR
RT-PCR was performed as described [62]. 20 µg RNA was treated with 1 unit of RNAse free DNase

Authors' contributions
JD introduced the 1B4 mutation into the germline, MR analyzed the 1B4 mutation in mice and cells and characterized the hnRNPB1/A2 cDNA and ER supervised the project.