Evaluation of the chicken transcriptome by SAGE of B cells and the DT40 cell line
- Matthias B Wahl†1, 4,
- Randolph B Caldwell†1,
- Andrzej M Kierzek2, 4, 5,
- Hiroshi Arakawa1,
- Eduardo Eyras3,
- Nina Hubner1,
- Christian Jung1,
- Manuel Soeldenwagner1,
- Manuela Cervelli1,
- Yan-Dong Wang1,
- Volkmar Liebscher3 and
- Jean-Marie Buerstedde1Email author
© Wahl et al; licensee BioMed Central Ltd. 2004
Received: 28 October 2004
Accepted: 21 December 2004
Published: 21 December 2004
The understanding of whole genome sequences in higher eukaryotes depends to a large degree on the reliable definition of transcription units including exon/intron structures, translated open reading frames (ORFs) and flanking untranslated regions. The best currently available chicken transcript catalog is the Ensembl build based on the mappings of a relatively small number of full length cDNAs and ESTs to the genome as well as genome sequence derived in silico gene predictions.
We use Long Serial Analysis of Gene Expression (LongSAGE) in bursal lymphocytes and the DT40 cell line to verify the quality and completeness of the annotated transcripts. 53.6% of the more than 38,000 unique SAGE tags (unitags) match to full length bursal cDNAs, the Ensembl transcript build or the genome sequence. The majority of all matching unitags show single matches to the genome, but no matches to the genome derived Ensembl transcript build. Nevertheless, most of these tags map close to the 3' boundaries of annotated Ensembl transcripts.
These results suggests that rather few genes are missing in the current Ensembl chicken transcript build, but that the 3' ends of many transcripts may not have been accurately predicted. The tags with no match in the transcript sequences can now be used to improve gene predictions, pinpoint the genomic location of entirely missed transcripts and optimize the accuracy of gene finder software.
The definition of transcription units within a finished genome sequence in higher eukaryotes is challenging and relies on genome mapping of cDNAs and ESTs backed up by theoretical gene finder algorithms. An increasing number of gene sequences from model organisms have made the prediction of well conserved ORFs easier, but less conserved coding and untranslated regions are difficult to detect. The best way to unambiguously define transcription units is full length cDNAs, but large scale projects are expensive in terms of labor and costs. One also needs to bear in mind that some cDNAs elude detection, because unusual secondary structure or toxicity inhibits reverse transcription or cloning.
SAGE investigates the transcription profile of a given cell sample by large scale sequencing of short cDNA tags derived from the bulk mRNA [1, 2]. Whereas tag mapping to the cDNA and genome databases of the organism indicates the type of expressed genes, the prevalence of individual tags within the library reflects their relative levels of expression. Since SAGE tags are only short sequences, they can be collected more easily in higher numbers than ESTs and full length cDNA sequences. The potential of SAGE to discover new or better define already known transcription units is particularly advantageous in situations where the entire genome sequence of an organism has been determined, but gene predictions based on theoretical algorithms and the mapping of a relatively small number of EST and cDNA sequences remain tentative. LongSAGE generates longer tags of 21 bases as compared to the classical SAGE protocol and is therefore better suited for the unambiguous assignments of tag to genome sequences [3, 4].
Cellular and molecular features of early B cell development  and lymphoma formation [6, 7] have been extensively studied in the chicken. Gene expression signatures of primary bursal B cells, pre-neoplastic and neoplastic lymphoma cells were collected by microarray hybridizations in a first attempt to identify genes up- or down-regulated during myc-induced B cell lymphoma development .
The whole chicken genome including a genome scale transcript build from Ensembl  and a collection of bursal full-length cDNAs  have recently been released. We describe here the mapping of large collections of SAGE tags from bursal lymphocytes and DT40 to these reference datasets to evaluate the quality of the transcript build. Furthermore, the transcription profiles of bursal cells and DT40 as defined by this first SAGE analysis in the chicken should lead to a better understanding of B cell transformation and facilitate the selection of candidate genes for disruption in DT40 .
Results and Discussion
Generation of SAGE tag libraries and SAGE tags collections
SAGE and unitag collections
Average frequency of matching SAGE tags within library
Average count of SAGE tag per unitag
Tag to gene assignment using bursal cDNAs, the Ensembl transcript build and the genome sequence
Successful mapping of SAGE tags to reference sequences is influenced by the quality of the sequences, the complexity of the reference sequence datasets and the prevalence of polymorphisms within the tag sequences. It was therefore decided to first search for matches within a bursal cDNA collection which represents the best possible reference dataset, as it was derived from the same tissue and genetic background as the busage library. Subsequently, unitags were mapped to the Ensembl transcript build and finally the chicken genome sequence. Unitags found in a previous dataset were not searched for any more in the next. To facilitate the searches, candidate tags starting with the CATG tetra-nucleotide were extracted from each reference dataset prior to analysis.
Unitag mapping to reference datasets
Dataset matches of unitag
Average count of SAGE tags per unitag
Ensembl transcript build
SAGE tags are expected at the position of the NlaIII site closest to the polyA tail of the transcript, but alternative transcript processing as well as incomplete NlaIII digestion or internal priming can produce upstream tags. Indeed, when the positions of the matching candidate tags were analyzed for bursal cDNA transcripts, about 40% of the tags matched to non-last positions (data not shown).
Mapping of tags to the genome
Locations of unitags having a single match in genome but no transcript match
Bases searched next to annotated Ensembl transcripts
Matches only downstream of Ensembl transcripts
Matches only upstream of Ensembl transcripts
Matches upstream and downstream of Ensembl transcripts
Within Ensembl transcript boundaries
Outside Ensembl transcript boundaries
Relationship of genome mapping unitags to Ensembl transcripts
Analysis of unitags mapping 5' of or within Ensembl transcript boundaries. #
BLAST result ##
Supporting bursal EST
Unitag relationship to Ensembl transcript ###
Unitags mapping 5'
Upstream 5' exon
PEF protein with a long N-terminal hydrophobic domain
Upstream 5' exon (EST supports two additional 5' exons)
Upstream 5' exon
5' upstream/Exon1 (EST supports one additional 5' exon)
5' upstream/Exon1 (EST supports one additional 5' exon)
Unitags mapping within transcript boundaries
Aldo-keto reductase family 1 member
Protein kinase C, beta type
Centromeric protein E
T-cell activation leucine repeat-rich protein
Bcl-2-associated transcription factor
Unitag mapping to transcripts
Match to annotated transcript
Match to genome within boundaries of annotated transcript
Match next to annotated transcript using 5000 base cut-off
Match distant from annotated transcript
With only multiple genome matches
With match to annotated transcripts or single genome match
Significant gene expression differences between bursal cells and the DT40
List of genes differentially expressed in bursal cells and DT40
Best BLAST result##
(AAH61765) Hypothetical protein
(AAH69219) Cold inducible RNA-binding protein
(Q7ZUR6) Similar to muscle-specific beta 1 integrin binding protein
(Q90YW7) Ribosomal protein L4
(Q9YGQ1) Peptide elongation factor 1-beta
(CAA31409) Chinese hamster asparagine synthetase
(P13796) L-plastin (Lymphocyte cytosolic protein 1)
(Q8BGQ8) Heterogeneous nuclear ribonucleoprotein K
(AAH46152) Selenoprotein P precursor
(Q96CJ1) Testosterone regulated apoptosis inducer and tumor suppressor
(P30281) G1/S-specific cyclin D3
(Q8JHJ4) TNF family B cell activation factor
(Q90YB0) FEN-1 nuclease
(Q13200) 26S proteasome non-ATPase regulatory subunit 2
(Q90W60) XNop56 protein
(P22794) Ecotropic viral integration site 2A protein
(Q91XC8) Similar to death-associated protein
(Q99P44) Leucine aminopeptidase
(P97440) Histone RNA hairpin-binding protein
(P34022) Ran-specific GTPase-activating protein
(Q9UMR2) ATP-dependent RNA helicase DDX19
(Q9YGQ1) Peptide elongation factor 1-beta
(AAQ20009) Heterogeneous nuclear ribonucleoprotein H1-like protein
(Q9YGQ1) Peptide elongation factor 1-beta
(Q9H165) B-cell lymphoma/leukemia 11A
The mapping of the SAGE tags to the recently released cDNA collections and the chicken genome has been useful to assess the completeness and accuracy of the current transcript catalog. On the positive side, it appears that the transcript build may have missed only a low percentage of genes, since relatively few tags map to genome regions far away from annotated transcription units. On the downside, fewer than 6,000 of over 19,000 tags with matches to reference sequences could be mapped to transcripts. The majority of the tags missed in transcripts are positioned downstream of annotated transcripts with a minority mapping upstream or within the genomic boundaries of transcripts. The most straightforward explanation for this is that many transcripts in the current version of the chicken transcriptome do not accurately reflect the 3' and the 5' ends of transcripts. This proposition is independently supported by the comparisons of the bursal full length cDNAs to the Ensembl transcript build which detected discrepancies to Ensembl annotated transcripts for approximately 50% of the cDNAs . Another explanation for at least part of the missing transcript matches is variability in poly-adenylation and splicing, which seems to account for substantial variety in the human transcriptome .
Accurate definitions of the transcribed parts of the chicken genome is highly desirable not only to ascertain the correct ORFs, but also to identify transcription and translational control sequences often located in 5' and 3' untranslated regions. It should be interesting to use the genomic positions of the missed transcript tags in combination with current gene finder algorithms to improve transcript coverage. Many of the missed tags are close to already annotated exons facilitating this task. It should also be possible to use promising tag sequences to screen cDNA libraries for clones whose sequence will identify missed genes or exons. The riken1 bursal cDNA library is of excellent quality and should be suitable for this purpose.
Although the presented SAGE data provides valuable information about the expression levels of many genes in bursal cells and the DT40 cell line, the full potential of SAGE for gene expression profiling could not be exploited due to the difficulties in tag to gene assignment. Nevertheless, this first SAGE analysis in the chicken lays the basis for further studies. SAGE has the advantage that data from different experiments and laboratories are easily comparable as the tag sequences serve as a common standard. Accumulation of additional data will increasingly facilitate the interpretation of results because bona fide tags will be distinguished from artifacts by being replicated and even polymorphic tags will eventually be defined and assigned to their corresponding transcripts.
LongSAGE library construction
Total RNA from bursal tissue of chicken 20 day old CB-inbred chicks and from DT40 Cre1 cells  was extracted using TRIzol reagent (Invitrogen) according to the manufacturer's instructions. PolyA RNA was isolated using the mRNA DIRECT kit from Dynal http://www.dynal.no. The RNA bound to oligo(dT)25 magnetic beads was immediately used for the construction of a LongSAGE library [1, 3] following a modified protocol as described previously . High fidelity PfuUltra (Stratagene) polymerase was used for the PCR amplification step. The SAGE libraries from bursal tissue and DT40 were named busage and dt40sage respectively. For each library, distinct Linker/Primer combinations were used to exclude accidental amplification of ditags from the other library.
Sequencing of SAGE library clone inserts
The pZero-1 (Invitrogen) plasmids containing SAGE ditags as multimeric inserts were transformed into E. coli. Zeocin resistant colonies transformed by the plasmids were grown at low density on agar plates, picked and directly suspended in 50 microliters of H2O. This suspension was heated at 95°C for 10 minutes and stored at -20°C until further processing. The PCR amplification used primers from the plasmid backbone, M13 forward and reverse. Sequencing was performed using the Big Dye v3.1 ready reaction mix (Applied Biosystems) and a nested primer (SSP2) from the plasmid poly-linker. Reactions were analyzed on an ABI 3730 DNA Analyzer (Applied Biosystems). The raw sequencing files were processed as described previously .
Ditag, tag and unitag definition
The library insert sequences were searched for ditags in which the flanking CATG tetra-nucleotides are separated by a spacer sequence of more than 31 and less than 37 bases. Ditags of identical sequence were entered only once for each library to avoid the possibility of entering PCR amplification artifacts. The ditags were then divided into two SAGE tags of 21 bases including the CATG tetra-nucleotides. The combined SAGE tag collections of both libraries were normalized to generate a collection of unitags possessing unique tag sequences. A low number of tags (197 of 129,568 total tags) were found to be identical to the sequences of the linker tags used for the library construction and therefore were removed. Care was taken to minimize the possibility of tag sequence errors by using a high fidelity polymerase for the PCR amplification step of the library construction and by rejecting any ditag sequences which contained even a single ambiguous base call or a PHRED score lower than 10. It is possible that some unitags are due to sequencing errors, but these artificial tags are unlikely to match transcript or genome sequences.
To map the unitags to reference sequences, candidate tags were extracted from i) full length bursal cDNA sequences , ii) the Ensembl transcript build ftp://ftp.ensembl.org/pub/current_chicken/data/fasta/cdna/ and iii) the chicken chromosome sequences ftp://ftp.ensembl.org/pub/current_chicken/data/fasta/dna/. Candidate tags in the transcript datasets were extracted only in the sense orientation whereas both strands of the chromosome sequences were searched. The SAGE tags, unitags and candidate tags together with relevant information concerning their positions and frequencies were entered into tables of a relational database to facilitate further analysis.
Unitag matches were sequentially searched for in the bursal cDNA collection, the Ensembl transcript build and the Genome. Once a match had been identified, that tag fell out of the remaining search process and only matches of identical sequences were accepted. To relate the position of matching unitags in the genome sequence to the Ensembl transcripts, the chromosome coordinates of the Ensembl transcripts and their orientation were extracted from their headers. The database table structure, all tabulated entries as well as the FOUNTAIN software  used for the analysis is freely available for download under http://pheasant.gsf.de/SAGE/download/ and http://pheasant.gsf.de/DEPARTMENT/FOUNTAIN.html.
Calculation of the significance of SAGE count differences
cDNA was synthesized from bursal tissue and DT40 Cre1 cell line using the SuperScript Preamplification System (Invitrogen). Primers were designed to amplify a region of a few hundred base pairs encompassing the SAGE unitag sequence of the reference transcript. PCR amplification was performed using the Expand Long Template PCR System (Roche) under the following conditions: 2 min initial incubation at 93°C; 20, 25, 30 and 35 cycles consisting of 10 sec at 93°C, 30 sec at 65°C and 5 min at 68°C with 20 sec elongation per cycle.
This work was supported by the EU grants 'Genetics in a cell line' and 'Mechanisms of gene integration'. We would like to thank Kenji Imai for stimulating discussion.
- Velculescu VE, Zhang L, Vogelstein B, Kinzler KW: Serial analysis of gene expression. Science. 1995, 270: 484-487.View ArticlePubMedGoogle Scholar
- Madden SL, Wang CJ, Landes G: Serial analysis of gene expression: from gene discovery to target identification. Drug Discov Today. 2000, 5: 415-425. 10.1016/S1359-6446(00)01544-0.View ArticlePubMedGoogle Scholar
- Saha S, Sparks AB, Rago C, Akmaev V, Wang CJ, Vogelstein B, Kinzler KW, Velculescu VE: Using the transcriptome to annotate the genome. Nat Biotechnol. 2002, 20: 508-512. 10.1038/nbt0502-508.View ArticlePubMedGoogle Scholar
- Modrek B, Lee C: A genomic view of alternative splicing. Nat Genet. 2002, 30: 13-19. 10.1038/ng0102-13.View ArticlePubMedGoogle Scholar
- Pike KA, Baig E, Ratcliffe MJ: The avian B-cell receptor complex: distinct roles of Igalpha and Igbeta in B-cell development. Immunol Rev. 2004, 197: 10-25.View ArticlePubMedGoogle Scholar
- Hayward WS, Neel BG, Astrin SM: Activation of a cellular onc gene by promoter insertion in ALV-induced lymphoid leukosis. Nature. 1981, 290: 475-480. 10.1038/290475a0.View ArticlePubMedGoogle Scholar
- Neiman PE, Clurman BE, Lobanenkov VV: Molecular pathogenesis of myc-initiated B-cell lymphomas in the bursa of Fabricius. Curr Top Microbiol Immunol. 1997, 224: 231-238.PubMedGoogle Scholar
- Neiman PE, Ruddell A, Jasoni C, Loring G, Thomas SJ, Brandvold KA, Lee Rm, Burnside J, Delrow J: Analysis of gene expression during myc oncogene-induced lymphomagenesis in the bursa of Fabricius. Proc Natl Acad Sci USA. 2001, 98: 6378-6383. 10.1073/pnas.111144898.PubMed CentralView ArticlePubMedGoogle Scholar
- International Chicken Genome Sequencing Consortium: Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature. 2004, 432: 695-716. 10.1038/nature03154.View ArticleGoogle Scholar
- Caldwell R, Kierzek A, Arakawa H, Bezzubov Y, Zaim J, Fiedler P, Kutter S, Blagodatski A, Kostavska D, Koter M, Carninci P, Hayashizaki Y, Buerstedde JM: Full-length cDNAs from bursal lymphocytes to facilitate gene function analysis. Genome Biology.
- Buerstedde JM, Arakawa H, Watahiki A, Carninci PP, Hayashizaki YY, Korn B, Plachy J: The DT40 website: Sampling and connecting the genes of a B cell line. Nucl Acid Res. 2002, 30: 230-231. 10.1093/nar/30.1.230.View ArticleGoogle Scholar
- Pleasance ED, Marra MA, Jones SJ: Assessment of SAGE in transcript identification. Genome Res. 2003, 13: 1203-1215. 10.1101/gr.873003.PubMed CentralView ArticlePubMedGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the False Discovery Rate – A Practical and Powerful Approach to Multiple Testing. J Roy Stat Soc B Met. 1995, 57 (1): 289-300.Google Scholar
- Arakawa H, Lodygin D, Buerstedde JM: Mutant loxP vectors for selectable marker recycle and conditional knock-outs. BMC Biotechnol. 2001, 1: 7-10.1186/1472-6750-1-7.PubMed CentralView ArticlePubMedGoogle Scholar
- Wahl M, Shukunami C, Heinzmann U, Hamajima K, Hiraki Y, Imai K: Transcriptome analysis of early chondrogenesis in ATDC5 cells induced by bone morphogenetic protein 4. Genomics. 2004, 83: 45-58. 10.1016/S0888-7543(03)00201-5.View ArticlePubMedGoogle Scholar
- Abdrakhmanov I, Lodygin D, Geroth P, Arakawa H, Law A, Plachy J, Korn B, Buerstedde JM: A large database of chicken bursal ESTs as a resource for the analysis of vertebrate gene function. Genome Res. 2000, 10: 2062-2069. 10.1101/gr.10.12.2062.PubMed CentralView ArticlePubMedGoogle Scholar
- Buerstedde JM, Prill F: FOUNTAIN: a JAVA open-source package to assist large sequencing projects. BMC Bioinformatics. 2001, 2: 6-10.1186/1471-2105-2-6.PubMed CentralView ArticlePubMedGoogle Scholar
- Ruijter JM, Van Kampen AH, Baas F: Statistical evaluation of SAGE libraries: consequences for experimental design. Physiol Genomics. 2002, 11: 37-44.View ArticlePubMedGoogle Scholar