PDbase: a database of Parkinson's Disease-related genes and genetic variation using substantia nigra ESTs
© Yang et al. 2009
Published: 3 December 2009
Skip to main content
© Yang et al. 2009
Published: 3 December 2009
Parkinson's disease (PD) is one of the most common neurodegenerative disorders, clinically characterized by impaired motor function. Since the etiology of PD is diverse and complex, many researchers have created PD-related research resources. However, resources for brain and PD studies are still lacking. Therefore, we have constructed a database of PD-related gene and genetic variations using the substantia nigra (SN) in PD and normal tissues. In addition, we integrated PD-related information from several resources.
We collected the 6,130 SN expressed sequenced tags (ESTs) from brain SN normal tissues and PD patients SN tissues using full-cDNA library and normalized cDNA library construction methods from our previous study. The SN ESTs were clustered in 2,951 unigene clusters and assigned in 2,678 genes. We then found up-regulated 57 genes and down-regulated 48 genes by comparing normal and PD SN ESTs frequencies with over 0.9 cut-off probability of differential expression based on the Audic and Claverie method. In addition, we integrated disease-related information from public resources. To examine the characteristics of these PD-related genes, we analyzed alternative splicing events, single nucleotide polymorphism (SNP) markers located in the gene regions, repeat elements, gene regulation elements, and pathways and protein-protein interaction networks.
We constructed the PDbase database to capture the PD-related gene, genetic variation, and functional elements. This database contains 2,698 PD-related genes through ESTs discovered from human normal and PD patients SN tissues, and through integrating several public resources. PDbase provides the mitochondrion proteins, microRNA gene regulation elements, single nucleotide polymorphisms (SNPs) markers within PD-related gene structures, repeat elements, and pathways and networks with protein-protein interaction information. The PDbase information can aid in understanding the causation of PD. It is available at http://bioportal.kobic.re.kr/PDbase/. Supplementary data is available at http://bioportal.kobic.re.kr/PDbase/suppl.jsp
The age-related neurodegenerative diseases prevalence is growing continuously due to a permanent increase in the human life span . It affects almost half of all patients with dementia. Parkinson's disease (PD) is the second most common age-related neurodegenerative disease, which results in abnormalities in motor function . Due to the high-frequency of PD, many researchers have tried to find the causation of PD. The disease is clinically characterized by impaired motor function, manifested by resting tremors, rigidity, bradykinesia, and postural instability . PD is caused by the degeneration of dopaminergic neurons in the substantia nigra (SN) pars compacta .
Although the causation of PD is diverse and complex combination of the mitochondrial proteins' dysfunction, genetic variation effects in cell cycles, and environmental risk factors, it is now clear that genetic factors contribute to the pathogenesis of the disease [5–9]. However, the etiology of sporadic PD, occurring in 95% of the cases , is still not fully understood. To solve this problem, several resources have been incorporated to help PD studies such as MDPD  and PDGene . MDPD provides a unique functionality to compare the differences in the type of mutations among ethnic groups manually examined by biomedical researchers . PDGene at the Gene Prospector application provides evidence about human genes in relation to Parkinson's disease and risk factors from association studies . Although useful integrated PD-related information has focused on genetic mutation and PD-association studies, there remains a limitation in public resource-dependent information. Therefore, we constructed experimental resources to investigate a wide spectrum of molecular events prior to integrating the PD-related public resources. Because public databases for SNPs and diseases are large, complicated, and difficult to use, we have developed the pipeline system to provide disease-related genes and genetic variations.
We collected the 6,130 substantia nigra (SN) expressed sequenced tags (ESTs) from full-length cDNA libraries of brain SN normal tissues and PD patients SN tissues using oligo-capping methods in a previous study . These SN ESTs were deposited in No.s DT214917~DT221046 at the dbEST database, NCBI. The full-length cDNA library was constructed using an improved capping method with the pCNS-D2 vector . A normalized cDNA library was also constructed to obtain genes that are rarely expressed by the previous method . We checked the repeat elements using RepeatMasker program http://repeatmasker.org. To get high-quality SN ESTs, we went through several filtering steps: 1) removing the short length ESTs, 2) removing ESTs contaminated by genomic DNAs and E. coli, and 3) removing ESTs not aligned in any UniGene cluster. We analyzed the PD candidate genes with this ESTs pool containing 2,850 SN ESTs from PD patients and 2,883 SN ESTs from normal tissues. We carried out the annotation of SN ESTs based on UniGene clusters and then obtained 2,679 genes' information with 5,733 UniGene clusters.
The annotation of the SN ESTs was carried out using the human RefSeq mRNA  and the UniGene database (build #217) for similarity comparisons based on the UniGene clusters (Shown in supplementary Data Table1). Our SN ESTs were clustered in 2,951 unigene clusters and assigned in 2,678 genes. Since we constructed the full-length cDNA libraries using the oligo-capping technique , these SN ESTs can be resourced to examine the multiple transcription start sites comparing mRNA transcription start sites. To investigate amino acid changes, we compared the SN ESTs sequences to the RefSeq protein sequences using BLASTX .
To study the global expression of genes possibly associated with Sporadic PD constituting most PD cases , we accounted for the number of SN ESTs from PD patients and normal tissues assigned in same gene. The frequency of each gene was analyzed by dividing the number of ESTs of a gene by the number of total clones merged into the UniGene database build #217 in each full-length cDNA library. Genes that were abundantly expressed were selected and listed among the ESTs. Significant differences in gene expression among the datasets were calculated using the Audic and Claverie method . We analyzed the probability of differential expression between the normal full-length SN library and the PD full-length SN library at a cut-off probability of 0.9 (shown in supplementary Data Table2). Finally, we found 57 up-regulated genes and 48 down-regulated genes through the comparison of normal and PD SN ESTs frequencies. MBP of them was reported to be up-regulated in PD SN . The up-regulated genes were associated with structural constituents of the myelin sheath, cytokine activity, transcription regulator activity, GTPase activity, calcium ion binding, or RNA binding on molecular function. The down-regulated genes were associated with oxidoreductase activity, serin-type endopeptidase inhibitor activity, phosphatidylethanolamine binding, mu-type opioid receptor binding, Rho GTPase activator activity, integrin binding, monooxygenase activity, or lipid binding.
To create a consensus sequence, we mapped the SN ESTs, mRNAs, and UniGene EST clusters having at least one mRNA to exclude the pseudo genes onto human genome using BLAT and SIM4. We used consensus sequences to eliminate non-consensus features of each UniGene cluster, after filtering out EST sequencing errors or contamination by a minority of similar but paralogous sequences. Then EST-mRNA alignment was generated using the SIM4 program, producing a consensus sequence that excludes minority features such as unaligned ends and inserts due to chimeric sequences or unspliced introns. The matching genomic region was aligned with the complete set of ESTs and mRNAs for the UniGene cluster using BLAT  and SIM4 . The SN EST sequences were aligned in human genomic sequences with a 75% minimum score and 90% minimum identity. When coordinates had non-canonical splice sites, we confirmed the exon-intron junction sites with the SIM4 program to perform alignments of expressed and genomic DNA sequence data efficiently and accurately, allowing for introns in the genomic sequence, and a relatively small number of sequencing errors .
Alternative splicing was detected by a computational procedure using genomic-EST-mRNA multiple sequence alignments. Alternative splicing types were derived from these isoforms retaining all possible alternative splicing information . SN ESTs with poor coverage were filtered out to remove non-consensus splice sites and regions with poor coverage. We categorized the alternative splicing types such as alternative start, alternative end, alternative 5' exon, alternative 3' exon, exon skipping, mutually excluded exons, or intron retention. Alternative starts and ends were identified if the first or last exon in a gene model was part of an alternative region. Alternative cassettes were labelled as such if the junction skipped one exon.
PDbase contains transcripts representing several alternative splicing events (Shown in Supplementary data, Table. 3). SN ESTs were associated with alternative splicing events in 321 genes. To examine candidate genes having the PD-specific alternative splicing patterns, we compared the alternative splicing patterns of normal SN ESTs and PD patients ESTs. We found that thirty-five PD-specific candidate genes having alternative splicing events were up-regulated in PD SN tissues: for example, AQP1, DCXR, DKK3, EEF1A1, GNAS, PGK1, SUCLG1, and THTPA. The major alternative splicing events in genes up-regulated in PD SN tissues are alternative transcription start or end sites. This may be a reason to construct SN full-length cDNA libraries using the oligo-capping method to replaces the cap structure specific to the 5' end of eukaryotic mRNA with oligonucleotides .
To provide the global PD-related gene features, we integrated PD-related gene information, as well as knowledge-based information. Because public databases for SNPs and diseases are large, complicated, and difficult to use, their integration is challenging. We collected 2,701 genes associated with PD and average 323 genetic variations in genomic region through our pipeline system for the disease-related gene and genetic integration . This integrated information is based on human gene nomenclature (HGNC)  and UniProt , genetic variation from dbSNP (version 129) , and disease information from Online Mendelian Inheritance in Man (OMIM) , Human Gene Mutation Database (HGMD) , and the Genetic Association database (GAD) . We examined the PD-related gene distribution to cover several domains of molecular and cellular biology based on the Gene Ontology database . In addition, we surveyed the protein-protein interaction (PPI) of the PD-related genes from the Human Protein Reference Database (HPRD) . It has been reported that the degeneration of dopaminergic neurons of SNpc is conducted by dysfunction of the mitochondrial complex through activation of mitochondria-dependent apoptotic molecular pathways . Hence, we investigated mitochondrion proteins associated with PD from MitoDat (Mendelian Inheritance and the Mitochondrion) . We found 31 mitochondrial proteins located in inner membrane (68%), outer membrane (19%), inter membrane space (6%), or matrix (6%). The numbers in parentheses indicate the percentages of mitochondrial proteins located in each organelle among the total number of mitochondrial proteins. There are, for example, solute carrier family 25 (ANT1, ANT2, ANT3), ATP synthase, H+ transporting, mitochondrial F1 complex (ATP5A1, ATP5B, ATP5C1, ATP5F1), and kinesin heavy chain member (HK2).
We also investigated molecular and cellular signalling pathways associated with PD-related genes from the BioCarta models http://www.biocarta.com/genes/index.asp and the KEGG databases . To examine the RNA elements involved in the regulation of PD-related genes, we searched microRNA elements related to PD genes from the mirBASE database as experimental micro-RNA resources  and conserved mammalian microRNA regulatory target sites for conserved microRNA families in the 3'UTR regions of RefSeq genes predicted by TargetScanS at the UCSC table track . In addition, we utilized multiple transcription start sites, CpG island, and repeat elements on the UCSC table tracks (download March 2009).
To show an example of a PD-related gene search, we present query results for a gene, SPP1, which is secreted phosphoprotein 1 (osteopontin, bone sialoprotein I, early T-lymphocyte activation 1). When a user queries this gene, SPP1, comprehensive information including frequency differences between the two full-length libraries from SN PD and normal tissues is seen. There are 19 SN PD ESTs and 34 SN normal ESTs in the PDbase database. The SPP1 was down-regulated at a statistically significant level in more than one sample having a probability of 0.977 < p < 0.98. This gene is known as a high anti-apoptotic gene from a previous cell death activity study . In addition, the user can get general gene information and genetic information containing the SNP marks located in this gene region, repeat elements, and alternative splicing events from the PD and normal SN ESTs. Three micro-RNAs can be associated with regulation of this gene, which has experimentally confirmed protein-protein interaction with eighteen other proteins and belongs to the regulators of the bone mineralization pathway. This SPP1 gene was represented as a PD target gene through our human SN ESTs analysis and verified using RT-PCR and neurotoxin, a 1-methyl-4-phenyl-1,2,3,6-tetrahydropiridine (MPTP)-treated mice model . The query results through the PDbase database are more helpful to researchers than results obtained from published previous databases.
We constructed a database of PD-related genes and genetic variation using SN ESTs, called PDbase. PDbase contains 2,698 genes and the biological characteristics of these genes in two ways: 1) through 303 cDNA libraries from human normal and PD's SN tissues and 2) by integrating information on disease-related genes and genetic variation. Mitochondrial DNA variants in PD play various roles. Mitochondrial dysfunction has been reported as the etiology of neurodegenerative diseases . Thus, PDbase also provides the PD-related mitochondrion proteins, microRNA, Single Nucleotide Polymorphisms (SNPs) markers within PD-related gene structures, repeat elements, and pathways and networks with protein-protein interaction information. PDbase integrates not only public resources, but also un-reported PD target genes discovered from normal and PD SN ESTs. It can serve as specific biomarkers for PD or neurodegenerative diseases and novel drug development. Also, PDbase can provide insight into the pathogenesis of PD and identify molecular targets of potential therapeutic significance for the neurodegeneration.
Other papers from the meeting have been published as part of BMC Bioinformatics Volume 10 Supplement 15, 2009: Eighth International Conference on Bioinformatics (InCoB2009): Bioinformatics, available online at http://www.biomedcentral.com/1471-2105/10?issue=S15.
This research was supported by a grant from KRIBB Research Initiative Program, the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korea government (MEST) (No. M10869030002-08N6903-00210), and a grant of the MOST 21C Frontier R & D program in neuroscience from the Ministry of Science & Technology of Korea. We thank Maryana Bhak for editing the manuscript and Ha-Na Byun for web image design.
This article has been published as part of BMC Genomics Volume 10 Supplement 3, 2009: Eighth International Conference on Bioinformatics (InCoB2009): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/10?issue=S3.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.