Analysis of the ZR-75-30 genome
Together, these data provide a gene-level analysis of most of the unamplified genome rearrangements in this cell line, of more than 10 kb span. A few details are still missing, notably the centromeric breakpoints, and some balanced breakpoints. Balanced breakpoints are invisible to array-CGH and not all were sampled by the paired-end sequencing or fine-mapped in our previous array painting.
Paired-end sequencing has various limitations, and combining with other structural data as we have done is clearly valuable. Firstly, the method is not expected to find all rearrangements, because it samples the genome at random, and coverage is dependent on GC content . Also, reads in repeats and segmental duplications generally cannot be used because they cannot be mapped to a unique match in the reference genome. Secondly, artefactual rearrangements can be created by coligation of DNA fragments during preparation for sequencing, and by errors in mapping reads.
Sampling of junctions was surprisingly good: we accounted for 97% of the copy number steps detected by array-CGH in the amplicon, where the greater number of reads across the junctions increased sensitivity. This suggests that, even using only 36 bp reads, rather few junctions would be undetectable because they are flanked by non-unique sequences. The lower sampling of single-copy junctions resulted in about 55% of the junctions detected by array-CGH being detected by sequencing. Conversely, we identified almost twice as many junctions in the amplicon as we expected from the copy number steps. These were presumably a mixture of artefacts and additional rearrangements that are not resolved by CGH, either because they involve small fragments or are balanced.
Another limitation of paired end sequencing is that it does not show how junctions are joined together, e.g. whether two apparently-neighbouring junctions are on the same chromosome or not, nor whether the region between is interrupted by further junctions . This is illustrated by two of the fusion genes, APPBP2-PHF20L1 and PLEC1-ENPP2, both transcribed across more than one genomic junction.
ZR-75-30 expresses at least 12 fusion transcripts
By combining molecular cytogenetic approaches—high-resolution array-CGH and array painting—with paired-end sequencing, we have catalogued genome rearrangements of this cell line and found 9 expressed fusion transcripts. We combined this with 3 additional fusion transcripts found by sequencing cDNA , for which we have identified the genomic junctions.
Nine of 12 fusions in ZR-75-30 are in the complex coamplification of chromosomes 8 and 17, the fusions APPBP2-PHF20L1, BCAS3-HOXB9, TAOK1-PCGF2 and DDX5-DEPDC6/DEPTOR being most amplified. Such complex coamplifications are common  and probably give the ‘firestorm’ pattern of multiple small amplified fragments seen in array-CGH [22, 39]. The MCF7 cell line has a similar coamplification involving chromosomes 1, 3, 17, and 20 and containing highly-amplified gene fusions .
Of these 12 fusion genes, seven were formed by intra-chromosomal rearrangements, confirming that more fusion genes are formed by intra-chromosomal rearrangement than by chromosome translocation . This might be expected if rearrangements arise at replication bubbles  rather than random breakage and rejoining.
How many expressed fusion genes are there in breast cancers?
Extrapolating from our work and Robinson DR et al. , ZR-75-30 may have around 18 expressed fusion genes and breast cancers in general—not cell lines—may express on average around 10.
In ZR-75-30, using structural analysis, we found half of the six expressed fusions detected by Robinson DR et al. , while, using cDNA sequencing, they found three of the nine we detected—both figures suggest the true total might be around 18. This is consistent with recent, probably incomplete, figures from other cell lines: 20 expressed fusions have been verified in MCF7, with several more predicted computationally [6, 13, 15, 40]; 43 have been found in BT474 and 13 in SKBR3 .
Breast cancers—as opposed to cell lines—appear to have almost as many fusions. Robinson DR et al.  identified an average of 4.2 expressed fusions per case (0 to 20 in 38 breast tumours), compared to 5.5 per case in cell lines. Their sensitivity seems to have been around 40%, comparing their findings with ours and with the published cell line data above. This gives a best guess that breast tumours will on average express 10 fusions , with wide variation from cases to case, as expected from their variable levels of rearrangement .
Are these passenger or driver mutations?
The fusions found here argue strongly that some at least are selected, i.e. ‘driver’ mutations, rather than random incidental ‘passenger’ mutations . As detailed in the supplementary discussion in Additional file 9, several of the genes involved have already been found to be fused in other breast cancer cell lines—PHF20L1 and BCAS3[6, 13, 15, 21, 44] —or in other tumours—BCAS3 again, and PCGF2, TAOK1 and TRPS1[45, 46]. Others are members of families that include multiple fused genes—the collagens, HOX and PHF families. Several of the fusions resemble known recurrent gene fusions in general functional terms [1, 2]: for example, fusions of HOXB9, PCGF2, PHF20L1, and NRIP1 would be typical of the many known fusions that control gene expression directly or via chromatin structure, and all could encode functional domains of the proteins. Several of the genes involved are also in signalling pathways relevant to breast cancer: ERBB2, NRIP1 and BCAS3 are involved in estrogen receptor function and APPBP2 with androgen receptor; while TAOK1 and SKAP1 are involved in MAPK signalling and DEPDC6/DEPTOR regulates mTOR signalling.
Several of the fused genes are also recurrently broken in a substantial proportion of breast cancers, as judged by copy number steps in array-CGH of 1000 breast tumours : around 10% have breaks in ERBB2, BCAS3 and SKAP1, while COL14A1, TIAM1, USP32, TAOK1 are broken in around 4%.
Some of the fusions, and particularly those not expressed, may simply inactivate a copy of the participating gene(s) [1, 6]. For example, our fusions of TIAM1 and TAOK1 inactivate one copy of these genes. Some genes, e.g. BCAS3, that are fused in more than one cancer cell line retain different, non-overlapping parts of the gene in different cases, suggesting the common theme is inactivation. In some cases fusion of a gene may suppress its expression, perhaps by destabilising the mRNA: among the predicted fusion genes for which we could not detect a transcript, unfused copies of some of the 5’ participating genes were transcribed—for example SSH2, NUDCD1 and TRAPPC9 (Table 1; Additional file 7).