- Open Access
Pancreatic Expression database: a generic model for the organization, integration and mining of complex cancer datasets
BMC Genomicsvolume 8, Article number: 439 (2007)
Pancreatic cancer is the 5th leading cause of cancer death in both males and females. In recent years, a wealth of gene and protein expression studies have been published broadening our understanding of pancreatic cancer biology. Due to the explosive growth in publicly available data from multiple different sources it is becoming increasingly difficult for individual researchers to integrate these into their current research programmes. The Pancreatic Expression database, a generic web-based system, is aiming to close this gap by providing the research community with an open access tool, not only to mine currently available pancreatic cancer data sets but also to include their own data in the database.
Currently, the database holds 32 datasets comprising 7636 gene expression measurements extracted from 20 different published gene or protein expression studies from various pancreatic cancer types, pancreatic precursor lesions (PanINs) and chronic pancreatitis. The pancreatic data are stored in a data management system based on the BioMart technology alongside the human genome gene and protein annotations, sequence, homologue, SNP and antibody data. Interrogation of the database can be achieved through both a web-based query interface and through web services using combined criteria from pancreatic (disease stages, regulation, differential expression, expression, platform technology, publication) and/or public data (antibodies, genomic region, gene-related accessions, ontology, expression patterns, multi-species comparisons, protein data, SNPs). Thus, our database enables connections between otherwise disparate data sources and allows relatively simple navigation between all data types and annotations.
The database structure and content provides a powerful and high-speed data-mining tool for cancer research. It can be used for target discovery i.e. of biomarkers from body fluids, identification and analysis of genes associated with the progression of cancer, cross-platform meta-analysis, SNP selection for pancreatic cancer association studies, cancer gene promoter analysis as well as mining cancer ontology information. The data model is generic and can be easily extended and applied to other types of cancer. The database is available online with no restrictions for the scientific community at http://www.pancreasexpression.org/.
Pancreatic ductal adenocarcinoma (PDAC) usually presents at an advanced stage so that surgical cure is rarely achieved and conventional chemotherapy and radiotherapy have little impact, resulting in a very low 5-year survival rate (0.5%–5%) . Thus a number of laboratories have focused on studying the evolution of pancreatic cancer from its earliest stages (pancreatic intraepithelial neoplasias or PanINs), putting pancreatic cancer among the best studied tumour tissue types at the molecular level. Thus a wealth of information regarding mutated and aberrantly expressed genes, miRNAs and proteins is now available, not only significantly boosting our biological understanding of the disease but also helping to identify new (early) diagnostic and therapeutic targets. Unfortunately, the huge and still rising volume and diversity of public pancreatic datasets makes it increasingly difficult for researchers to integrate this information into their current research efforts. In this report, we describe a dedicated Pancreatic Expression database  aiming to overcome this restriction, and furthermore propose it as a generic model for the organization, integration and presentation of complex cancer research data. The model is designed to address various research problems, ranging from the specimen origin and type, through cancer development stages to expression patterns. By bringing complex profiling data together, the Pancreatic Expression database should enable scientists worldwide to perform a whole range of user-friendly queries, from deciphering the biological mechanisms underlying pancreatic disease to target discovery.
Construction and Content
The aim of the Pancreatic Expression database is to provide a comprehensive mining tool for large-scale genomic, transcriptomic and proteomic data sets. In order to achieve this, we designed a robust internal structure encompassing specific pre-defined modules (which can be found under the "Filters" section in the database) including "pancreatic specimen/cell type", "pancreatic differential expression information", "genes differentially expressed in" and "genes expressed in" modules. Our design enables uploading of any available (pancreatic) datasets that comply with the structure of the pre-defined modules. Each module contains a number of subcategories related to the module name, which are fundamental to store and retrieve user-defined sub-datasets from the database by setting filters to the specific subcategories within each module. The "pancreatic specimen/cell type" module covers categories such as normal (microdissected ductal cells (ND) or bulk normal pancreas (NP), acinar cells, islet cells, stromal cells and pancreatic stellate cells), and disease specimens from both exocrine (pancreatic intraepithelial neoplasias (PanIN-1A, PanIN-1B, PanIN-2, PanIN-3), chronic pancreatitis (CP), pancreatic adenocarcinoma (PDAC), intraductal papillary mucinous neoplasms (IPMN), mucinous cystic tumours and ampullary carcinoma) and endocrine (functioning and non-functioning tumours) origin. Moreover, pancreatic juice, plasma, urine, serum, and fine needle aspirates are included as additional options to further broaden future expansion of the database. The "pancreatic differential expression information" module provides information on direction of regulation (up- and down-regulation), fold-change, SAGE tag number and whether a gene or protein was found to be expressed only in pancreatic adenocarcinoma (PDAC) or in normal pancreas. The "genes differentially expressed in" module enables more defined selection of comparison methods such as pancreatic adenocarcinoma (PDAC) versus normal pancreas (bulk tissue or microdissected normal ductal cells), chronic pancreatitis (CP) versus normal pancreas (bulk tissue or microdissected normal ductal cells), chronic pancreatitis (CP) versus pancreatic adenocarcinoma (PDAC), pancreatic intraepithelial neoplasias (PanIN-1A, PanIN-1B, PanIN-2 or PanIN-3) versus normal pancreas (ND) or microdissected normal ductal cells), etc. The "genes expressed in" module lists the genes expressed in the tissue types defined in the pancreatic specimen/cell module, irrespective of their mode of regulation (whether they are differentially expressed or not). The "platform technology" module enables the selection of the technology used, such as Affymetrix arrays, cDNA arrays, Sanger human 10K cDNA arrays version 1.2.1, Sanger custom 5K1 cDNA arrays, Clontech Atlas Human Cancer cDNA Expression Array, SAGE, Agilent Human Genome CGH array, 2D PAGE, SELDI, etc. The data is stored in a data management system created using MySQL  and based on the open-source BioMart technology , a simple, federated query system designed specifically for use with large datasets. We imported the available Ensembl  human genome annotations (Ensembl release 41) for genes and proteins, SNP information, sequences, gene structure and multi-species data enabling the integration and annotation of heterogeneous pancreatic cancer data. In order to avoid integration and annotations errors, we used the pre-established Ensembl annotations and microarray probe set mapping. Ensembl links to UniProt/Swiss-Prot, RefSeq and UniProt/TrEMBL databases are made on the basis of sequence similarity. All other subsequent links are inferred from these mappings. Ensembl also establishes mappings to microarray probe set identifiers by matching probe set sequences to Ensembl transcripts . We also integrated the antibody data from the Human Protein Atlas  based on Ensembl gene ID.
The Pancreatic Expression database currently contains 32 datasets from 20 different published sources, from 14 international laboratories encompassing 22 different platforms (Affymetrix GeneChip Human Full Length Array HuGeneFL, Affymetrix GeneChip Human Genome U95 Set (HG-U95A, HG-U95B, HG-U95C, HG-U95D, HG-U95E), Affymetrix GeneChip Human Genome U133 Array Set (HG-U133A, HG-U133B), 2D PAGE, cDNA arrays, SAGE, Operon oligo array version 2.0, Clontech Atlas Human Cancer cDNA Expression Array, immunohistochemistry, in situ hybridisation, Oligo array, MALDI, mass spectrometry, Sanger human 10K cDNA arrays version 1.2.1, Sanger custom 5K1 cDNA arrays, United Gene Technique Ltd, BD PowerBlot Western array and qRT-PCR) [8–27]. These initial datasets provide valuable information about 7636 gene expression measurements from a first-pass selection of relevant papers in the field of pancreatic research; however, the inclusion of additional relevant datasets will be a continuous and ongoing process. All the datasets were manually processed, checked for accuracy and consistency and loaded into our relational database alongside annotations from several public resources such as Ensembl, GO, dbSNP, UniProt and the Human protein atlas. Currently, several modules are present for which data are either not yet incorporated or not available (ICAT, iTRAQ), but these will be populated as it is our intention to continuously extend the current data content and cover all the existing modules as and when the data becomes available.
Utility and Discussion
The Pancreatic Expression database provides access not only to bioinformatic and biostatistic experts but also to bench researchers with a limited knowledge of bioinformatics. The database allows multiple levels of access. Firstly, access to the data is provided through a customized version of MartView, a BioMart web-based query interface based on Perl API . The interface is navigated using the left panel with user selections taking place in the right panel. A summary of user choices is also displayed in the left panel. A simple query involves choosing attributes (or using the default ones) and optionally filters if one wants to restrict the query (Figure 1). Secondly, the Pancreatic Expression database is available from the BioMart central server (Figure 2A) where it is exposed to third party software such as the Bioconductor  package biomaRt  therefore allowing its easy interrogation within the open source R statistical environment  and its integration into any expression profiling experiment (Figure 2B). In addition, the database is exposed to the Taverna workflow system  (Figure 2C) and to the Galaxy framework (Figure 2D) . The data can be also accessed programmatically through web services (Figure 3) . A query constructed in the web-based query interface can be easily converted into an xml or perl template for future bioinformatics expansion and use. Finally, the Pancreatic Expression database is a DAS server providing a Pancreatic Expression DAS annotation available at the Ensembl GeneView (Figure 4) .
Examples of use
Navigation between all data types is simple and user-friendly; a variety of possible query combinations allow researchers quickly to determine the most de-regulated genes and proteins across all platforms.
Using the Pancreatic Expression database, it is possible to search and retrieve genes/proteins expressed only in pancreatic cancer and not in chronic pancreatitis and then ask which of these are present in urine and/or plasma. Such a query would be a first step for the discovery of non-invasive pancreatic cancer biomarkers from body fluids (Figure 5).
Researchers interested in the genes involved in the progression of pancreatic cancer can select the corresponding information among the differential expression datasets for the various tumour stages and retrieve the genes found to be de-regulated in the progression of pancreatic cancer (Figure 6). In the same way, one can search for genes specific to certain types of pancreatic cancer (Figure 1B).
Our database also allows cross-platform meta-analysis. Scientists can investigate pancreatic expression profiling performed across a wide range of different platforms (such as cDNA arrays or oligo arrays) to detect the most consistent sets of de-regulated genes (Figure 7). Importantly, scientists can also retrieve the sets of overlapping genes between their own results obtained by their particular platform (Proteomics, Affymetrix, Illumina etc.) and annotation method (UniProt, RefSeq, HGNC Hugo etc.) and those reported in the studies stored in the Pancreatic Expression database (Figure 8).
As they are available through the BiomaRt package , annotations can be added to any disease expression profiling experiment, which will allow detection of genes de-regulated in both pancreatic cancer and any other disease (Figure 2B). One can also obtain the gene ontology classification of the retrieved datasets or mine the genes of interest for a specific ontology term (Figure 9). Investigations using association studies can be designed using the Pancreatic Expression database by selecting a specific category of functional consequences (coding non-synonymous, 3' UTR, 5' UTR, splice site etc.) for SNPs associated with genes involved in pancreatic cancer (Figure 10). Expression data from a specific anatomical site can also be retrieved (Figure 11). Researchers can obtain immunohistochemistry data, where available, by selecting the antibodies filter (Figure 12). Scientists interested in promoter analysis can easily combine the gene search with the human genome upstream sequences and therefore collect the promoter sequences in a fast and simple way allowing further analysis of transcription factor-binding sites (Figure 1A).
Our integration model brings together relevant pancreatic cancer datasets and annotations from public sources and enables scientists to perform a wide variety of complex queries on various types of data. The design of the database allows easy integration of additional modules and annotations from new public databases.
The Pancreatic Expression database constitutes a unique and valuable resource for the wider cancer research community, and is in rapid and constant development. We aim to continuously import new data sources and update the database on a regular basis, and invite scientists worldwide to deposit and share their data.
Although initially constructed using pancreatic cancer expression datasets, we have designed and implemented a generic system that can be easily modified and applied to any other type of cancer. The system is available for collaboration with all interested research groups either by extending it to include other cancer data or by sharing our model should they want to adopt it for their data.
Availability and requirements
Project name: Pancreatic Expression database
Project home page: http://www.pancreasexpression.org
Operating system(s): Platform independent; Standard WWW browser (Safari, Firefox)
Programming language: Perl, SQL, BioMart data management system
Licence: The database is freely available to academic and non-academic users. However, should you find the Pancreatic Expression database useful to your work, please cite this paper.
Schneider G, Siveke JT, Eckel F, Schmid RM: Pancreatic cancer: basic and clinical aspects. Gastroenterology. 2005, 128 (6): 1606-1625. 10.1053/j.gastro.2005.04.001.
Pancreatic Expression database. [http://www.pancreasexpression.org]
Ensembl microarray probeset mapping. [http://www.ensembl.org/info/about/docs/microarray_probe_set_mapping.html]
Human Protein Atlas. [http://www.proteinatlas.org]
Van Heek NT, Maitra A, Koopmann J, Fedarko N, Jain A, Rahman A, Iacobuzio-Donahue CA, Adsay V, Ashfaq R, Yeo CJ, Cameron JL, Offerhaus JA, Hruban RH, Berg KD, Goggins M: Gene expression profiling identifies markers of ampullary adenocarcinoma. Cancer Biol Ther. 2004, 3 (7): 651-656.
Adachi J, Kumar C, Zhang Y, Olsen JV, Mann M: The human urinary proteome contains more than 1500 proteins, including a large proportion of membrane proteins. Genome Biol. 2006, 7 (9): R80-10.1186/gb-2006-7-9-r80.
Anderson NL, Polanski M, Pieper R, Gatlin T, Tirumalai RS, Conrads TP, Veenstra TD, Adkins JN, Pounds JG, Fagan R, Lobley A: The human plasma proteome: a nonredundant list developed by combination of four separate sources. Mol Cell Proteomics. 2004, 3 (4): 311-326. 10.1074/mcp.M300127-MCP200.
Buchholz M, Braun M, Heidenblut A, Kestler HA, Kloppel G, Schmiegel W, Hahn SA, Luttges J, Gress TM: Transcriptome analysis of microdissected pancreatic intraepithelial neoplastic lesions. Oncogene. 2005, 24 (44): 6626-6636. 10.1038/sj.onc.1208804.
Crnogorac-Jurcevic T, Efthimiou E, Capelli P, Blaveri E, Baron A, Terris B, Jones M, Tyson K, Bassi C, Scarpa A, Lemoine NR: Gene expression profiles of pancreatic cancer and stromal desmoplasia. Oncogene. 2001, 20 (50): 7437-7446. 10.1038/sj.onc.1204935.
Crnogorac-Jurcevic T, Efthimiou E, Nielsen T, Loader J, Terris B, Stamp G, Baron A, Scarpa A, Lemoine NR: Expression profiling of microdissected pancreatic adenocarcinomas. Oncogene. 2002, 21 (29): 4587-4594. 10.1038/sj.onc.1205570.
Crnogorac-Jurcevic T, Gangeswaran R, Bhakta V, Capurso G, Lattimore S, Akada M, Sunamura M, Prime W, Campbell F, Brentnall TA, Costello E, Neoptolemos J, Lemoine NR: Proteomic analysis of chronic pancreatitis and pancreatic adenocarcinoma. Gastroenterology. 2005, 129 (5): 1454-1463. 10.1053/j.gastro.2005.08.012.
Crnogorac-Jurcevic T, Missiaglia E, Blaveri E, Gangeswaran R, Jones M, Terris B, Costello E, Neoptolemos JP, Lemoine NR: Molecular alterations in pancreatic carcinoma: expression profiling shows that dysregulated expression of S100 genes is highly prevalent. J Pathol. 2003, 201 (1): 63-74. 10.1002/path.1418.
Friess H, Ding J, Kleeff J, Fenkell L, Rosinski JA, Guweidhi A, Reidhaar-Olson JF, Korc M, Hammer J, Buchler MW: Microarray-based identification of differentially expressed growth- and metastasis-associated genes in pancreatic cancer. Cell Mol Life Sci. 2003, 60 (6): 1180-1199.
Grutzmann R, Pilarsky C, Ammerpohl O, Luttges J, Bohme A, Sipos B, Foerder M, Alldinger I, Jahnke B, Schackert HK, Kalthoff H, Kremer B, Kloppel G, Saeger HD: Gene expression profiling of microdissected pancreatic ductal carcinomas using high-density DNA microarrays. Neoplasia. 2004, 6 (5): 611-622. 10.1593/neo.04295.
Hu L, Evers S, Lu ZH, Shen Y, Chen J: Two-dimensional protein database of human pancreas. Electrophoresis. 2004, 25 (3): 512-518. 10.1002/elps.200305683.
Iacobuzio-Donahue CA, Maitra A, Shen-Ong GL, van Heek T, Ashfaq R, Meyer R, Walter K, Berg K, Hollingsworth MA, Cameron JL, Yeo CJ, Kern SE, Goggins M, Hruban RH: Discovery of novel tumor markers of pancreatic cancer using global gene expression technology. Am J Pathol. 2002, 160 (4): 1239-1249.
Logsdon CD, Simeone DM, Binkley C, Arumugam T, Greenson JK, Giordano TJ, Misek DE, Kuick R, Hanash S: Molecular profiling of pancreatic adenocarcinoma and chronic pancreatitis identifies multiple genes differentially regulated in pancreatic cancer. Cancer Res. 2003, 63 (10): 2649-2657.
Lu Z, Hu L, Evers S, Chen J, Shen Y: Differential expression profiling of human pancreatic adenocarcinoma and healthy pancreatic tissue. Proteomics. 2004, 4 (12): 3975-3988. 10.1002/pmic.200300863.
Maitra A, Hansel DE, Argani P, Ashfaq R, Rahman A, Naji A, Deng S, Geradts J, Hawthorne L, House MG, Yeo CJ: Global expression analysis of well-differentiated pancreatic endocrine neoplasms using oligonucleotide microarrays. Clin Cancer Res. 2003, 9 (16 Pt 1): 5988-5995.
Nakamura T, Furukawa Y, Nakagawa H, Tsunoda T, Ohigashi H, Murata K, Ishikawa O, Ohgaki K, Kashimura N, Miyamoto M, Hirano S, Kondo S, Katoh H, Nakamura Y, Katagiri T: Genome-wide cDNA microarray analysis of gene expression profiles in pancreatic cancers using populations of tumor cells and normal ductal epithelial cells selected for purity by laser microdissection. Oncogene. 2004, 23 (13): 2385-2400. 10.1038/sj.onc.1207392.
Segara D, Biankin AV, Kench JG, Langusch CC, Dawson AC, Skalicky DA, Gotley DC, Coleman MJ, Sutherland RL, Henshall SM: Expression of HOXB2, a retinoic acid signaling target in pancreatic cancer and pancreatic intraepithelial neoplasia. Clin Cancer Res. 2005, 11 (9): 3587-3596. 10.1158/1078-0432.CCR-04-1813.
Shen J, Person MD, Zhu J, Abbruzzese JL, Li D: Protein expression profiles in pancreatic adenocarcinoma compared with normal pancreatic tissue and tissue affected by pancreatitis as detected by two-dimensional gel electrophoresis and mass spectrometry. Cancer Res. 2004, 64 (24): 9018-9026. 10.1158/0008-5472.CAN-04-3262.
Tan ZJ, Hu XG, Cao GS, Tang Y: Analysis of gene expression profile of pancreatic carcinoma using cDNA microarray. World J Gastroenterol. 2003, 9 (4): 818-823.
Terris B, Blaveri E, Crnogorac-Jurcevic T, Jones M, Missiaglia E, Ruszniewski P, Sauvanet A, Lemoine NR: Characterization of gene expression profiles in intraductal papillary-mucinous tumors of the pancreas. Am J Pathol. 2002, 160 (5): 1745-1754.
Pancreatic Expression database web-based query interface. [http://www.pancreasexpression.org/biomart/martview]
Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, Huber W: BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics. 2005, 21 (16): 3439-3440. 10.1093/bioinformatics/bti525.
R project. [http://www.r-project.org]
Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T: Taverna: a tool for building and running workflows of services. Nucleic Acids Res. 2006, 34 (Web Server issue): W729-32. 10.1093/nar/gkl320.
Pancreatic Expression database access through web services. [http://www.pancreasexpression.org/biomart/martservice]
The funding for this project was obtained through FW6 EU project MolDiag-Paca and Cancer Research UK programme grant C355/A6253.
Disclaimer: Information in the Pancreatic Expression database is curated from highly relevant published pancreatic cancer papers. However, the quality and accuracy of the published data are solely the responsibility of the authors. The Pancreatic Expression database is a mining tool to the literature rather than a substitute for the experiments. We highly recommend researchers to trace the origin of the data to check if the data may comply with their quality standards. We also recommend researchers to apply independent technologies to confirm data retrieved through our mining tool prior to integrating them into the individual research efforts.
CC designed and implemented the web site and database, annotated and integrated the data, contributed to the data collection and wrote the manuscript. SAH defined the pancreatic cancer modules, collected pancreatic data, tested the database, contributed to the revision of the manuscript and continuous discussion. HJW, SB, DH and TPR were involved in the pancreatic data collection. TCJ contributed to the definition of the pancreatic cancer modules and pancreatic data collection. TCJ and NRL provided valuable guidance and expertise on pancreatic cancer, contributed to the critical revision of the manuscript and continuous discussion. All authors read the final manuscript.
Nicholas R Lemoine and Tatjana Crnogorac-Jurcevic contributed equally to this work.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.