GeneLink: a database to facilitate genetic studies of complex traits
© Gillanders et al; licensee BioMed Central Ltd. 2004
Received: 12 July 2004
Accepted: 18 October 2004
Published: 18 October 2004
In contrast to gene-mapping studies of simple Mendelian disorders, genetic analyses of complex traits are far more challenging, and high quality data management systems are often critical to the success of these projects. To minimize the difficulties inherent in complex trait studies, we have developed GeneLink, a Web-accessible, password-protected Sybase database.
GeneLink is a powerful tool for complex trait mapping, enabling genotypic data to be easily merged with pedigree and extensive phenotypic data. Specifically designed to facilitate large-scale (multi-center) genetic linkage or association studies, GeneLink securely and efficiently handles large amounts of data and provides additional features to facilitate data analysis by existing software packages and quality control. These include the ability to download chromosome-specific data files containing marker data in map order in various formats appropriate for downstream analyses (e.g., GAS and LINKAGE). Furthermore, an unlimited number of phenotypes (either qualitative or quantitative) can be stored and analyzed. Finally, GeneLink generates several quality assurance reports, including genotyping success rates of specified DNA samples or success and heterozygosity rates for specified markers.
GeneLink has already proven an invaluable tool for complex trait mapping studies and is discussed primarily in the context of our large, multi-center study of hereditary prostate cancer (HPC). GeneLink is freely available at http://research.nhgri.nih.gov/genelink.
In the past decade, hundreds of genes involved in the etiology of simple Mendelian disorders such as cystic fibrosis and Huntington's disease have been identified [1–3]. The genetic localization of these disorders, primarily through positional cloning approaches, has been highly successful because of the relatively simple model underlying disease pathogenesis. In the majority of these cases, a single mutated disease gene is both necessary and sufficient to cause the observed trait. In contrast, susceptibility to complex traits is heterogeneous, involving both multiple genetic and environmental risk factors, acting either independently or together.
Efforts to identify susceptibility genes involved in complex traits such as cancer, diabetes, hypertension, or Alzheimer's disease are complicated by genetic heterogeneity, incompplete penetrance, phenocopies, and the later age of onset of disease (thus unavailable DNA samples for parents of affected individuals). Each of these factors results in a significant reduction in power for any given study. Therefore, gene-mapping studies of complex traits require high-throughput genotyping performed on large collections of DNA samples using hundreds to thousands of polymorphic markers. The significant amounts of data generated during these genome surveys pose numerous data management challenges. In order to address these challenges, which are inherent in any large, collaborative genotyping study, we have developed a robust, easy-to-use database system named GeneLink.
GeneLink was initially developed to facilitate our studies of genetic susceptibility to prostate cancer, whose aims are to identify novel high- and moderate- penetrance genes involved in hereditary prostate cancer risk. These studies are multi-center collaborative efforts involving researchers from the United States, Finland, and Sweden [4–7]. The project included 496 families containing 5,247 individuals; DNA on 2,374 of these individuals was available for genotyping. We genotyped over 400 microsatellite markers for these individuals generating close to one million genotypes. This number is large but not atypical in gene mapping studies of complex traits. Given the considerable number of genotypes requiring analysis, it was obvious that we needed to develop a database management system that could handle such large quantities of data, as well as address data management issues unique to complex trait genetic analysis.
Currently, GeneLink's database uses Sybase SQL server ASE 12.5.1, which runs on a Sun V880 computer running Unix. GeneLink's Web scripts to access the database require Perl version 5.6.1 or greater. The necessary CPAN Perl modules required by GeneLink are DBI, DBD::Sybase, CGI, and Carp. These modules are usually included in standard Perl 5 releases. A Web server such as Apache 1.3.29 is also required to run the GeneLink Web scripts. GeneLink can operate on a Sun Enterprise 6500 or similar machine configured to operate as a Web server.
The Markers table stores information regarding all markers typed in a given project, including the panel in which the marker was run. A panel is defined as a group of microsatellite or SNP markers which can be electrophoresed simultaneously by taking advantage of different fluorescent dye labels and varying amplicon sizes. Also stored in the Markers table is the allele size range (ASR) of the marker, the fluorescent dye used to label the forward primer, and the marker-specific genotype for a CEPH control individual (e.g., CEPH 1347-02 was used in our prostate cancer study). The Primers table provides additional information for each specific marker. This information includes UniSTS ID, GenBank accession number, forward primer sequence, reverse primer sequence, and primer purchasing and inventory information. The Primers table also contains comprehensive genetic map information (deCODE, Généthon, and Marshfield positions) as well as physical location (Build, physical start position, and physical stop position). The Maps table stores the genetic map location of a marker in the genome, as well as the relative order of markers along a chromosome and the distance between adjacent markers. The markers typed thus far in our HPC study are di-, tri-, or tetra- nucleotide repeats; however GeneLink is capable of handling any combination of microsatellite and single nucleotide polymorphisms (SNP) data.
Each of GeneLink's 11 tables has measures built-in for quality control purposes. First, all changes (import, modification or delete) to GeneLink records are stamped with the date, time, and USER ID of the individual doing the editing. Changes to the families, pedigrees, or genotypes tables can be easily reviewed in a Histories table. Second, contents of every field are verified on import, and users are warned of any failures, such as invalid format or duplicate records. Within the Pedigrees table specifically, checks were designed to confirm the presence of all individuals designated as parents within the families, as well as confirm that all fathers are male and all mothers are female. In addition, when genotypes are imported into the Genotypes table, GeneLink confirms that each individual included in the import is designated as having DNA in the Pedigrees table. Next, GeneLink checks that each allele falls within the marker's designated allele size range (Markers table) and that the genotype for the control individual (e.g., 1347-02) matches what is expected (Markers table).
Results and discussion
When faced with the challenge of studying 496 hereditary prostate cancer families and a total of 5,247 individuals, we sought a publicly available database management system capable of handling the unique challenges that accompany a large-scale, multi-center genetic linkage study of a complex trait. Although data management systems have been developed [8–12], none could securely and efficiently handle a very large amount of data, as well as provide additional features to facilitate quality control and analysis of data generated. Therefore, we developed GeneLink, a database with unique features, to address these needs.
We designed GeneLink to use a Sybase database backend to take advantage of Sybase's ability to process large amounts of data. Currently, GeneLink is the only publicly available freeware database capable of efficiently storing millions of genotypes. The need for efficient data management will grow in importance as researchers explore genome-wide SNP association studies that may generate close to one billion genotypes (500 cases, 500 controls and 500,000 to 1,000,000 SNPs) . We are currently updating GeneLink so it can run using either Sybase or Oracle. Furthermore, GeneLink was designed to avoid database-specific code and therefore should be portable to other open access DB engines, such as PostgreSQL, without too much difficulty.
To collect the necessary number of DNA samples needed to provide sufficient power to detect linkage or association, collaborative efforts are almost always required. The Web-based interface of GeneLink facilitates multi-center collaborations, as data can easily be accessed via the Internet. GeneLink's Web-based interface also makes it platform-independent, a feature that was essential given the number of researchers who would be accessing it using various hardware-browser combinations. Other publicly-available databases described in the literature do not have this advantage. In this paper, we have presented GeneLink in the context of a collaborative effort in which multiple sites will need access to data generated in a single laboratory. However, GeneLink would also be valuable in the context of a meta-analysis of data generated in more than one laboratory. Making data access easier for our collaborators translated into the need for a sophisticated security system. Specifically, in our study of hereditary prostate cancer, researchers are permitted access to only their own set of data. This is important because, in some cases, a site's internal review board (IRB) protocol may not allow for raw data to be shared with other analysts.
Another challenge of complex trait linkage or association studies is formatting data appropriately for analysis by existing software packages. Chromosome-specific LINKAGE, GAS, and RelCheck format files can easily be exported by GeneLink. By design GeneLink's exporting capabilities also provide several additional advantages. First, GeneLink is capable of exporting multiple traits at the same time, thus facilitating analyses in which covariate information will be included. Second, by taking advantage of GeneLink's ability to generate liability classes defined by age, sex, and affection status, researchers can maximize power in the investigation of complex traits, which often exhibit reduced penetrance and phenocopies. Third, GeneLink's Allele Translation table allows comparison of alleles across families or across analyses, as each allele for each polymorphic marker will only be recoded once. This is particularly important as linkage disequilibrium or association studies become more common. Fourth, GeneLink's ability to export only a subset of families is critical, as genetic heterogeneity is a significant factor contributing to the difficulty of mapping genes involved in many complex traits. Multiple genes (RNASEL, ELAC2, and MSR1, among others; [14–16]) have been implicated in hereditary prostate cancer susceptibility, suggesting that genetic heterogeneity is likely to be a complicating factor in the gene mapping of HPC risk alleles regardless of the analysis method. Finally, GeneLink maintains a list of previously exported files, which eliminates redundant generation of data files by collaborators and functions as an archive of data files used for analyses.
Additional quality control measures were included in GeneLink's design. First, all changes to the database are recorded. As genetic studies of complex traits can be spread over many years, it is important to keep a detailed log of any changes made to the data. For example, an individual's affection status may change during the course of a study; therefore it is critical to track when this information was updated in the database. Second, in order to monitor data quality, GeneLink was also designed to perform several built-in checks, as described above.
GeneLink was designed primarily in the context of family-based studies of complex traits. It is capable of handling both linkage and association data, and can be used for both whole genome scans and/or candidate-gene studies. Further development of GeneLink will focus on extending its capabilities in regard to the case-based design. We recognize that both the family-based and case-based study designs have unique advantages, so we see it as critical to make GeneLink flexible enough to accommodate a case-based design. Currently, there is no limitation in storing case-based data however changes to GeneLink's exporting mechanisms should be made. Finally, in the same way that GeneLink is capable of storing "exported" data input files, future work will center on the storage of analysis results. Again, this would be helpful for multi-center collaborative studies, which will continue to be critical to successful efforts to identify genes important in complex trait etiology.
In summary, GeneLink was designed specifically to ease the data management burden of mapping complex traits. It provides many functions that make it a uniquely powerful tool for use in genetic linkage or association studies. GeneLink simplifies merging genotypic data with pedigree, phenotype, and genetic or physical map information. Specifically, GeneLink's design makes it ideal for large-scale, multi-center studies, which will become more and more common in efforts to dissect the genetic factors contributing to complex trait etiology.
Availability and requirements
Project name: GeneLink
Project home page: http://research.nhgri.nih.gov/genelink
Operating system(s): Platform independent
Programming language: Perl
Other requirements: Sybase SQL server ASE 12.5.1, Perl version 5.6.1 or greater, CPAN Perl modules DBI, DBD::Sybase, CGI, and Carp, Web server such as Apache 1.3.29
License: Sybase SQL server ASE 12.5.1
Any restrictions to use by non-academics: none
EG, JT, JEBW and AB participated in database's design. AM, LU, KT and TW did all of the programming. EG, DG, PD, APK, MJ, DFL, and GI performed extensive testing of the database. EG, PD, TW, JEBW and AB drafted the manuscript. All authors read and approved the final manuscript.
The authors would like to thank Dr. Scott Diehl for early discussions of data management needs and database design.
- Rommens JM, lannuzzi MC, Kerem B, Drumm ML, Melmer G, Dean M, Rozmahel R, Cole JL, Kennedy D, Hidaka N, Zsiga M, Buchwald M, Riordan J, Tsui L, Collins FS: Identification of the cystic fibrosis gene: chromosome walking and jumping. Science. 1989, 245 (4922): 1059-1065.View ArticlePubMedGoogle Scholar
- Riordan JR, Rommens JM, Kerem B, Alon N, Rozmahel R, Grzelczak Z, Zielenski J, Lok S, Plavsic N, Chou JL, Drumm ML, lannuzzi MC, Collins FS: Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA. Science. 1989, 245 (4922): 1066-1073.View ArticlePubMedGoogle Scholar
- The Huntington's Disease Collaborative Research Group: A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes. Cell. 1993, 72 (6): 971-983. 10.1016/0092-8674(93)90585-E.View ArticleGoogle Scholar
- Lange EM, Gillanders EM, Davis CC, Brown WM, Campbell JK, Jones M, Gildea D, Riedesel E, Albertus J, Freas-Lutz D, Markey C, Giri V, Dimmer JB, Montie JE, Trent JM, Cooney KA: Genome-wide scan for prostate cancer susceptibility genes using families from the University of Michigan prostate cancer genetics project finds evidence for linkage on chromosome 17 near BRCA1. Prostate. 2003, 57 (4): 326-334. 10.1002/pros.10307.View ArticlePubMedGoogle Scholar
- Schleutker J, Baffoe-Bonnie AB, Gillanders E, Kainu T, Jones MP, Freas-Lutz D, Markey C, Gildea D, Riedesel E, Albertus J, Gibbs KD, Matikainen M, Koivisto PA, Tammela T, Bailey-Wilson JE, Trent JM, Kallioniemi OP: Genome-wide scan for linkage in finnish hereditary prostate cancer (HPC) families identifies novel susceptibility loci at 11ql4 and 3p25-26. Prostate. 2003, 57 (4): 280-289. 10.1002/pros.10302.View ArticlePubMedGoogle Scholar
- Wiklund F, Gillanders EM, Albertus JA, Bergh A, Damber JE, Emanuelsson M, Freas-Lutz DL, Gildea DE, Goransson I, Jones MS, Jonsson BA, Lindmark F, Markey CJ, Riedesel EL, Stenman E, Trent JM, Gronberg H: Genome-wide scan of Swedish families with hereditary prostate cancer: suggestive evidence of linkage at 5q11.2 and 19pl3.3. Prostate. 2003, 57 (4): 290-297. 10.1002/pros.10303.View ArticlePubMedGoogle Scholar
- Xu J, Gillanders EM, Isaacs SD, Chang BL, Wiley KE, Zheng SL, Jones M, Gildea D, Riedesel E, Albertus J, Freas-Lutz D, Markey C, Meyers DA, Walsh PC, Trent JM, Isaacs WB: Genome-wide scan for prostate cancer susceptibility genes in the Johns Hopkins hereditary prostate cancer families. Prostate. 2003, 57 (4): 320-325. 10.1002/pros.10306.View ArticlePubMedGoogle Scholar
- Seuchter SA, Skolnick MH: HGDBMS: a human genetics database management system. Comput Biomed Res. 1988, 21 (5): 478-487. 10.1016/0010-4809(88)90006-7.View ArticlePubMedGoogle Scholar
- Adams P: LABMAN and LINKMAN: a data management system specifically designed for genome searches of complex diseases. Genet Epidemiol. 1994, 11 (1): 87-98.View ArticlePubMedGoogle Scholar
- Cheung KH, Nadkarni P, Silverstein S, Kidd JR, Pakstis AJ, Miller P, Kidd KK: PhenoDB: an integrated client/server database for linkage and population genetics. Comput Biomed Res. 1996, 29 (4): 327-337. 10.1006/cbmr.1996.0024.View ArticlePubMedGoogle Scholar
- McMahon FJ, Thomas CJ, Koskela RJ, Breschel TS, Hightower TC, Rohrer N, Savino C, McInnis MG, Simpson SG, DePaulo JR: Integrating clinical and laboratory data in genetic studies of complex phenotypes: a network-based data management system. Am J Med Genet. 1998, 81 (3): 248-256. 10.1002/(SICI)1096-8628(19980508)81:3<248::AID-AJMG9>3.3.CO;2-X.View ArticlePubMedGoogle Scholar
- Li JL, Deng H, Lai DB, Xu F, Chen J, Gao G, Recker RR, Deng HW: Toward high- throughput genotyping: dynamic and automatic software for manipulating large-scale genotype data using fluorescently labeled dinucleotide markers. Genome Res. 2001, 11 (7): 1304-1314. 10.1101/gr.159701.PubMed CentralView ArticlePubMedGoogle Scholar
- Botstein D, Risch N: Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet. 2003, 33 (Suppl): 228-237. 10.1038/ng1090.View ArticlePubMedGoogle Scholar
- Carpten J, Nupponen N, Isaacs S, Sood R, Robbins C, Xu J, Faruque M, Moses T, Ewing C, Gillanders E, Hu P, Bujnovszky P, Makalowska I, Baffoe-Bonnie A, Faith D, Smith J, Stephan D, Wiley K, Brownstein M, Gildea D, Kelly B, Jenkins R, Hostetter G, Matikainen M, Schleutker J, Klinger K, Connors T, Xiang Y, Wang Z, De Marzo A, Papadopoulos N, Kallioniemi OP, Burk R, Meyers D, Gronberg H, Meltzer P, Silverman R, Bailey-Wilson J, Walsh P, Isaacs W, Trent J: Germline mutations in the ribonuclease L gene in families showing linkage with HPC1. Nat Genet. 2002, 30 (2): 181-184. 10.1038/ng823.View ArticlePubMedGoogle Scholar
- Tavtigian SV, Simard J, Teng DH, Abtin V, Baumgard M, Beck A, Camp NJ, Carillo AR, Chen Y, Dayananth P, Desrochers M, Dumont M, Farnham JM, Frank D, Frye C, Ghaffari S, Gupte JS, Hu R, Iliev D, Janecki T, Kort EN, Laity KE, Leavitt A, Leblanc G, McArthur-Morrison J, Pederson A, Penn B, Peterson KT, Reid JE, Richards S, Schroeder M, Smith R, Snyder SC, Swedlund B, Swensen J, Thomas A, Tranchant M, Woodland AM, Labrie F, Skolnick MH, Neuhausen S, Rommens J, Cannon-Albright LA: A candidate prostate cancer susceptibility gene at chromosome 17p. Nat Genet. 2001, 27 (2): 172-180. 10.1038/84808.View ArticlePubMedGoogle Scholar
- Xu J, Zheng SL, Komiya A, Mychaleckyj JC, Isaacs SD, Hu JJ, Sterling D, Lange EM, Hawkins GA, Turner A, Ewing CM, Faith DA, Johnson JR, Suzuki H, Bujnovszky P, Wiley KE, DeMarzo AM, Bova GS, Chang B, Hall MC, McCullough DL, Partin AW, Kassabian VS, Carpten JD, Bailey-Wilson JE, Trent JM, Ohar J, Bleecker ER, Walsh PC, Isaacs WB, Meyers DA: Germline mutations and sequence variants of the macrophage scavenger receptor 1 gene are associated with prostate cancer risk. Nat Genet. 2002, 32 (2): 321-325. 10.1038/ng994.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.