TICdb can be searched using a simple web form that allows the user to query the database by gene names. Only gene symbols approved by the HGNC are accepted; for this reason we include a list of valid HGNC gene identifiers, together with alternative aliases and a link to Entrez Gene, that will enable users to check gene aliases and HGNC approved names for their gene(s) of interest [see Additional file 2]. This list can also be accessed via a link to a page in which all HGNC names are hyperlinked, so that clicking on any one of them will perform the database search for that gene and take the user to the results page. This link is clearly shown in all database pages.
Searching for a valid HGNC symbol returns a table showing: i) a number identifying the fusion event; ii) for each partner gene involved in the translocation: the HGNC symbol, the description of the fragment harboring the breakpoint, and the position of the breakpoint within that fragment; and iii) the source of the junction sequence used to map the breakpoints (a cross-reference to either a Genbank or a PubMed record) hyperlinked to their respective databases. Clicking on any HGNC name in the results page will perform a database search for that gene. Fragment descriptions are in the form "ENSTX:IntronX", that is, the Ensembl stable identifier for the transcript, followed by the number of the intron (or exon) of that transcript containing the breakpoint. Fragment names are hyperlinked to Ensembl Exonview, so that the user can easily download the sequence of the fragment and locate the position of the breakpoint. Breakpoints derived from fusion transcript sequences, not mapped at the nucleotide level of resolution, are indicated as "breakpoint = 0". An Excel table with all the records present in the database sorted by gene names is provided [see Additional file 3] and is also easily accessible from the web pages. Searching with the wildcard "%" also returns all the entries in the database.
In the results page, genes are listed as 5' or 3' partner genes, depending on which part of the chimeric transcript is contributed by each gene. In this respect, it should be borne in mind that reciprocal fusion events frequently result in the generation of two chimeric transcripts, each corresponding to one of the translocated chromosomes. However, the oncogenic effect of the translocation is usually attributed to one of the fusion transcripts. For this reason, in these cases we consulted all Pubmed and Genbank sources and arranged the partner genes in the position (5' or 3') in which they appear in the fusion transcript most likely to be responsible for the disease. It should be noted that the same gene can appear as a 5' or 3' partner gene in different translocations.
The 414 unique fragments correspond to 378 introns and 36 exons, confirming that the vast majority of breakpoints (91.3%) are located within introns and that translocations very rarely disrupt exonic sequences. This is further supported by the fact that 15 of the exonic fragments that contain a translocation breakpoint are either the first or the last exon of the respective gene, with the breakpoint either located in the untranslated regions or keeping most of the coding sequence intact.
As mentioned before, all BLAST searches were manually curated. Visual inspection of BLAST outputs is necessary in order to resolve overlaps due to microhomologies, small deletions and insertions, and to choose the Ensembl transcript that is supported by a better annotation. This affords a high quality of the data contained in TICdb, at the cost of a rather time-consuming construction process. For this reason, general upgrades of the database will be performed only when a new NCBI build is released; regular updates including new information are much easier to do and are planned every 6 months.
The information contained in this database can be used to gain biological insights into the mechanisms leading to chromosome translocations in cancer. For instance, we have constructed a network of all the genes rearranged [see Additional file 1]. The content and topology of this network is very similar to that published by Höglund et al. [9], and follows a power law degree distribution. Since the network created by these authors was based on cytogenetic data, it has more nodes than our network. On the other hand, the interactions between nodes in our network are based on molecular data and so substantiate the findings of Höglund et al. at the molecular level.
Most importantly, TICdb should be very useful to those researchers trying to identify sequence motifs or functional and structural features associated with the appearance of a DNA double-strand break. Double-strand breaks are the initiating lesions that trigger a chromosome translocation, and the probability that a genomic region sustains a double-strand break might be dependent on its sequence context. In fact, several studies have shown that specific sequence motifs are significantly associated with translocation breakpoints in selected genes in some tumor types [10, 11], but genome-wide studies have been hindered by the lack of molecular data describing the location of all published translocation breakpoints in all types of malignancies, which is precisely the information provided by TICdb. In this regard, we have previously analyzed a smaller version of this database and could identify some structural features common to all translocations in human cancer [12]. Analyses such as this were very challenging, since the data required to perform them are scattered throughout several databases or in the literature. We expect that TICdb will greatly facilitate this task and thus become a useful resource in cancer genomics.