An integrated web medicinal materials DNA database: MMDBD (Medicinal Materials DNA Barcode Database)

Background Thousands of plants and animals possess pharmacological properties and there is an increased interest in using these materials for therapy and health maintenance. Efficacies of the application is critically dependent on the use of genuine materials. For time to time, life-threatening poisoning is found because toxic adulterant or substitute is administered. DNA barcoding provides a definitive means of authentication and for conducting molecular systematics studies. Owing to the reduced cost in DNA authentication, the volume of the DNA barcodes produced for medicinal materials is on the rise and necessitates the development of an integrated DNA database. Description We have developed an integrated DNA barcode multimedia information platform- Medicinal Materials DNA Barcode Database (MMDBD) for data retrieval and similarity search. MMDBD contains over 1000 species of medicinal materials listed in the Chinese Pharmacopoeia and American Herbal Pharmacopoeia. MMDBD also contains useful information of the medicinal material, including resources, adulterant information, medical parts, photographs, primers used for obtaining the barcodes and key references. MMDBD can be accessed at http://www.cuhk.edu.hk/icm/mmdbd.htm. Conclusions This work provides a centralized medicinal materials DNA barcode database and bioinformatics tools for data storage, analysis and exchange for promoting the identification of medicinal materials. MMDBD has the largest collection of DNA barcodes of medicinal materials and is a useful resource for researchers in conservation, systematic study, forensic and herbal industry.


Background
Herbal medicine is the ancient form of pharmaceutics, which is still used by many cultures for curing diseases. With a trend of living in harmony with Nature, the use of herbal materials for treatment and health maintenance is on the rise. At present, many plant, fungal and animal species are being used for treating diseases and the Chinese Pharmacopoeia has listed 670 commonly used species [1]. Herbal medicine is a useful source of bioactive compounds, such as oils obtained from the evening primrose (Oenothera biennis) for treating atopic dermatitis [2], and hyperforin extracted from St. John's Wort (Hypericum perforatum) as an antidepressant drug [3]. Nevertheless, their efficacy is critically dependent on the use of the correct material. If toxic adulterants or substitutes are administered, life-threatening poisoning may occur. In 2002, 63 people were reported with symptoms of general malaise, nausea and vomiting after consumption of herbal tea which was inadvertently mixed with neurotoxic Japanese star anise (Illicium anisatum) [4]. Adulteration resulting in an epidemic of severe kidney damages caused by aristolochic acid was first reported in Belgium in 1993 [5], followed by Hong Kong and Korea [6,7] in 2004. In these cases, the concerned herbs were substituted with the nephrotoxic Aristolochia species. A case of misusing Datura metel as Rhododendron molle was reported in Singapore in 2008 [8]. These two species share the same Chinese herb name "Naoyanghua", but D. metel contains anticholinergic compound that causes confusion, dilated pupils, and absence of sweating. Tradi-tionally, medicinal materials are identified by their organoleptic characteristics and physical properties such as shape, color, texture, and odor. However, the differences among related species or processed products are sometimes not obvious. Unique chemicals may serve as important markers for authentication, but chemical markers or profiles may be affected by the physiological and storage conditions.
With the advancement of molecular technology, DNA markers have now become a convenient means for species identification and molecular systematic study [9][10][11] and many DNA markers have in fact been patented for further development [12]. With the help of polymerase chain reaction (PCR), specific DNA regions can be amplified from only a small amount of samples. An unequivocal identification of a tested sample can be reached by comparing its DNA sequences against the sequence of an authentic sample. To develop a universal identification platform, The Consortium for the Barcode of Life (CBOL) proposed to set up a standardized sampling method and experimental protocol to analyze agreedupon 'DNA barcodes' [13].
DNA barcode is a short DNA sequence of an organism, which can be used to distinguish the organism from the other species. Mitochondrial cytochrome c oxidase subunit 1 (COI) is chosen as the standard for all groups of higher animals [14,15]. For plant species, COI is not a suitable barcode because it evolves much slower than that of animals. Plant researchers examined several coding and non-coding regions, but they soon realized that a single DNA locus has limited resolving power for closely related species [16,17]. Although more laborious than the single-locus approach, it is generally agree to combine two or more barcodes to increase the successful rate [18,19]. Recently, members of the CBOL plant working group evaluated seven chloroplast genes and proposed to use matK and rbcL as plant barcodes, based on the following criteria: easy to be amplified with a single primer pair, amenable to bidirectional sequencing with little manual editing, and high resolving power in species discrimination [20]. rbcL offers high universality and good discriminating power, whereas matK offers higher resolution. Nevertheless, the differentiation power of these two markers may not be high in closely related plant species [21]. Also, experiences from our group and other researchers showed that chloroplast genes including trnH-psbA spacer, trnL-F, and nuclear regions such as internal transcribed spacer (ITS) and 5S rRNA intergenic spacer are also useful for the authentication purpose [22][23][24].
The Barcode of Life Data System (BOLD) is an online informatics workbench for the management, analysis and use of DNA barcodes [25], which is managed by the Canadian Center for DNA Barcoding, University of Guelph. Basically, this system utilizes COI and internal transcribed spacer (ITS) for animal and fungal identification, respectively. For plant species identification, 2-locus combination of matK and rbcL are the default barcodes. The system also allows identification of unknown sequences up to species level provided by users.
Besides BOLD, some web-based barcode databases have been constructed to serve specific groups of organism. UNITE is an rDNA sequence database which contains 2842 ITS sequences from 1105 species of 152 genera of ectomycorrhizal fungi [26]. The main target of UNITE is to facilitate the identification of environmental samples of fungal DNA. In addition to similarity searches, UNITE has built-in maximum parsimony heuristic and neighbor joining phylogenetic tools for online analysis. All Leps Barcode of Life is a database for the identification and discovery of Lepidoptera. This database now has 448,054 barcodes from 40,907 species [27]. As insect has different morphological characters throughout its lifecycle, the DNA barcodes provide an accurate tool for the identification of the species. The Fish barcode of Life Initiative (FISH-BOL) addresses identification and natural history of various fish species through the use of COI sequence [28]. This database contains DNA barcodes, images and geospatial coordinates of examined specimens. For plant species, Genome Database for Rosaceae (GDR) is an integrated web-based relational database contains genetic markers and ESTs of the Rosaceae [29,30].
DNA barcodes are gaining popularity for authenticating medicinal materials [31][32][33][34][35]. Along with this trend, the 2010 edition of the Chinese Pharmacopoiea has included protocols for DNA extraction and DNA barcodes for selected medicinal materials. As a timely action, we set forth to establish the Medicinal Materials DNA Barcode Database (MMDBD) for recording the DNA barcode sequences, basic information and the key references of medicinal materials. The aims of the this database are: (1) to develop an organized and integrated web resource for DNA barcodes for medicinal species identification, (2) to collect and integrate the basic information of medicinal materials and their DNA barcodes, (3) to develop online tools and resources for sequence comparison. In this paper, we describe the structure and content of the database and reveal the database access utility and tools.

Medicinal materials information
Medicinal species listed in the Pharmacopoeia of the People's Republic of China [1], American Herbal Pharmacopoeia [36], and from prescriptions of folk medicine were chosen for including in the database. Substitutes, adulterants and closely related species were also included in the database for comparison. Currently, there are total 1259 species with 18,436 sequences available in this database. The scientific names, medicinal name, general informa-tion, and the classification of the materials in the Convention on International Trade in Endangered Species (CITES) [37] were collected. The voucher number of the samples, primer sequences and PCR conditions for generating barcode sequences are also provided as additional information. Photographs of live specimen or dried medical part of the medicinal materials were captured digitally. Live specimen images were mostly taken at the Chinese herbal garden of The Chinese University of Hong Kong (CUHK). Dried medicinal materials were provided by the Chinese Medicine Museum of the Institute of Chinese Medicine, CUHK. All images are of high resolution with 1200 × 800 pixels.

Sequence information DNA sequences selected for MMDBD
Medicinal materials include plant, animal, insect and fungal species. Considering the large number of species in the world, it is widely believed that a single set of barcodes should be adopted. Since CBOL has already chosen several DNA sequences as barcodes, undoubtedly, researchers will focus on these standard DNA regions in the future. Our primary goal is to include these DNA barcodes of medicinal materials in the MMDBD. On the other hand, many studies point out that other DNA regions are useful in species identification, and combining with the standard barcodes will increase the accuracy for differentiating closely related species. We therefore also include these "supplementary barcodes" in the MMDBD for reference. COI has been proposed as the barcode for higher animals in CBOL and BOLD [13,25]. Other mitochondrial regions, such as cytochrome b, 12S rRNA and 16S rRNA, are also proven to be useful in animal and insect species identification [38][39][40][41]. Consequently, all mitochondrial DNA sequences were included for medicinal materials originated from high animals and insects. ITS sequences have high differentiation power for the identification of fungal species [42,43], and accordingly we have selected this region for the fungal medicine. For plant materials, in addition to chloroplast DNA, nuclear DNA including ITS and 5S rRNA intergenic spacer are able to discriminate closely related plant species [23,44,45]. Also, combining two or more DNA regions may be necessary for some species. Hence, all available chloroplast DNA sequences and two nuclear gene spacers ITS and 5S rRNA intergenic spacer are included in the database. MMDBD therefore consists of barcodes proposed by CBOL and other useful DNA regions for identification of medicinal materials (Table 1).

DNA sequences extracted from public databases
INSD Seq eXtensible Markup Language (XML) files that contained DNA sequences and related information were downloaded from GenBank. Scripts were used to extract and filter sequence data from XML files, by which irrele-vant sequences such as microsatellite or mRNA sequences were excluded. To keep genes without standardized names, such as 'ITS', 'ITS1', 'ITS-1' or 'ITS-2' for ITS region, multiple alternative keywords were used to keep these genes in our dataset. Then all the extracted data were imported into the MYSQL database.

DNA sequences generated by our groups
Our group has generated 531 DNA for 189 medicinal materials, including ITS, 5S rRNA intergenic spacer, chloroplast trnL, trnL-F and mitochondrial cytochrome b regions. Samples and specimen were collected from various localities and authenticated by experts in pharmacognosy. Total DNA was extracted and selected DNA regions were amplified by polymerase chain reaction (PCR) using universal primers [46]. The PCR products were sequenced directly or in case of unresolved sequences, the products were first cloned and sequenced. Primer flanking sites on the determined DNA sequences were removed.

Software design and implementation
MMDBD was implemented on the relational database system MYSQL (version 5.0.45). NCBI BLAST (Basic Local Search Alignment Tool, v2.2.17) [47] was used as similarity search engine. Perl DBI module was employed to connect MYSQL, submit queries and obtain results. The MMDBD also uses the AJAX (asynchronous JavaScript and XML) technique which provides a rich and smooth user interface. Requests were posted to serverside PHP scripts and response was converted into JSON (Javascripts Object Notation) format for data interchange.
MMDBD consists of three tables which store general information, DNA sequences and references of the chosen medicinal materials (Table 2). Table "Medicinal material information" stores taxonomic, medicinal and morphological information. Field "CITESApp" records the status of endangered medicinal species in the CITES. Another field "ParmacopeiaInfor" is employed to identify all pharmacopoeia listed species in MMDBD. Table "Barcode sequence" is a sequence factory, which contains Foreign keys "MMId" and "RefId" are created to link with table "Medicinal material information" and "Reference", respectively. Some of the DNA sequences have been published and table "Reference" is to store the information of the articles including the author names, journal issue numbers, PubMed IDs and abstracts. It provides information for the users to trace the work that generates the DNA barcodes.

Database access and tools
MMDBD provides a simple web-based interface to retrieve barcode information (Fig. 1). The URL is: http:// www.cuhk.edu.hk/icm/mmdbd.htm. MMDBD has three major functions: database query by text-based interface, sequence similarity search, and data submission. Researchers can enrich the database by contributing their DNA barcode sequences to MMDBD. For data submission, we have provided a template file in EXCEL format, which enables users to upload batches of sequences and information via email. The administrator will then check the data quality before incorporating them to the database.

Database search interface Search view
There are two search methods: keywords search and Chinese character stroke number search (Fig. 2). In the former, user can enter a single word or phrase to search the herb name, species name, family name and references of the medicinal materials. The keywords are passed to server in UTF-8 encoding which supports both simplified and traditional Chinese characters. Another search method is by means of stroke number of Chinese species name. Stroke number table was created according to the stroke number of the first Chinese character of the species name. User can get the desired information of medicinal materials by following the hyperlinks in the table.

Sequence similarity search
MMDBD utilizes BLASTN algorithm for sequence similarity search. This function allows users to conduct homology searches between their sequences of interest and the data in the database (Fig. 3). The input sequence should be in raw sequence format and the length must be between 10 and 2000 bases. The word size in BLASTN is preset to 7 in order to improve algorithm search sensitivity for short sequence. User is also able to adjust the Evalue to further optimize the searching results. The BLAST output page displays the sequence homology in a rich interface in which user can go to the details page of the target barcode sequence by clicking the color bar links.

Future development
The database has already covered 66.5% and 84.5% of the medicinal materials listed in the Chinese Pharmacopoeia

Conclusions
The MMDBD is initiated to support the DNA barcoding initiative, which includes numerous important medicinal materials. MMDBD contains DNA barcode sequences, basic information and references of the medicinal materials. The integrated database provides users with easy access and retrieval of the data and web tools for sequence comparison. MMDBD will play a timely and important role for the authentication and quality control of medicinal materials and benefit the herbal industry. It will also be useful to investigators for conservation, forensic and systematic analysis.

Availability and requirements
The MMDBD is publicly available and can be accessed at http://www.cuhk.edu.hk/icm/mmdbd.htm.