De novo assembly and Characterisation of the Transcriptome during seed development, and generation of genic-SSR markers in Peanut (Arachis hypogaea L.)
© Zhang et al; licensee BioMed Central Ltd. 2012
Received: 30 November 2011
Accepted: 12 March 2012
Published: 12 March 2012
The peanut (Arachis hypogaea L.) is an important oilseed crop in tropical and subtropical regions of the world. However, little about the molecular biology of the peanut is currently known. Recently, next-generation sequencing technology, termed RNA-seq, has provided a powerful approach for analysing the transcriptome, and for shedding light on the molecular biology of peanut.
In this study, we employed RNA-seq to analyse the transcriptomes of the immature seeds of three different peanut varieties with different oil contents. A total of 26.1-27.2 million paired-end reads with lengths of 100 bp were generated from the three varieties and 59,077 unigenes were assembled with N50 of 823 bp. Based on sequence similarity search with known proteins, a total of 40,100 genes were identified. Among these unigenes, only 8,252 unigenes were annotated with 42 gene ontology (GO) functional categories. And 18,028 unigenes mapped to 125 pathways by searching against the Kyoto Encyclopedia of Genes and Genomes pathway database (KEGG). In addition, 3,919 microsatellite markers were developed in the unigene library, and 160 PCR primers of SSR loci were used for validation of the amplification and the polymorphism.
We completed a successful global analysis of the peanut transcriptome using RNA-seq, a large number of unigenes were assembled, and almost four thousand SSR primers were developed. These data will facilitate gene discovery and functional genomic studies of the peanut plant. In addition, this study provides insight into the complex transcriptome of the peanut and established a biotechnological platform for future research.
The peanut (Arachis hypogaea L.), also known as the groundnut, is an important oilseed crop in the tropical and subtropical regions of the world. It is grown on six continents but mainly in Asia, Africa and America. Peanuts are cultivated on 23.51 million hectares worldwide, with a total global production of approximately 35.52 million tons (the weight includes the shell). China is the largest producer in the world, accounting for 37.6% (13.34 million tons) of the total world production (FAO, 2009, http://faostat.fao.org).
Peanuts have a desirable fatty acid profile and are rich in vitamins, minerals and bioactive materials, including several known heart-healthy nutrients, such as monounsaturated and polyunsaturated fatty acids, potassium, magnesium, copper niacin, arginine, fibre, α-tocopherol, folates, phytosterols, and flavonoids. Indeed, peanut consumption has been associated with an improvement in the overall quality of the diet and nutrient [1–4].
In China, almost 60% of the peanuts are used to produce peanut oil . Peanut oil, due to its high monounsaturated fat content, is considered healthier than saturated oils and is resistant to rancidity. Monounsaturated fat, much of which is oleic acid, is a healthy type of fat that has been implicated in the health of skin  and has been demonstrated to reduce cardiovascular disease risk and/or risk factors in both epidemiological and clinical studies [1, 2, 7].
The development of the peanut seed has been studied intensely to understand the physiological, biochemical, and molecular characteristics that determine the oil quality and their beneficial nutritional contributions. However, the development of the peanut seed is a complex process involving a cascade of biochemical changes, which involve the transcriptional modulation of many genes, yet little is known about these transcriptional changes and their regulation. To date, little research in this area has been reported. Bi  developed a seed cDNA library for the peanut to analyse gene expression levels during seed development, and 17,000 expressed sequence tags (ESTs) were sequenced and used for microarray analysis. Recently, the development of next-generation high-throughput DNA sequencing technology has provided a novel method for both mapping and quantifying transcriptomes (RNA-seq) . RNA-seq technology has been successfully applied quite ubiquitously to species such as humans, yeast, mice, grape, Arabidopsis, rice, soybeans, sesame, and sweetpotato [9–19]. Moreover, RNA-seq data are highly reproducible, with few systematic discrepancies among technical replicates . The latest paired-end tag sequencing strategy of RNA-seq further improves the DNA sequencing efficiency and expands short-read lengths, providing a better depiction of transcriptomes . Transcriptomic information is used in a wide range of biological studies and provides fundamental insight into biological processes and applications, such as the levels of gene expression , the gene expression profiles during development [13, 17] or after experimental treatments , gene discovery , SSR mining [10, 11, 25], and SNP discovery [12, 25–27]. However, transcriptomic information is lacking for the peanut plant because this information is difficult to obtain and, to date, there has been little interest in such data.
Chen  reported that the accumulation of seed oil in peanuts could be divided into three stages based on phenotype, namely, the initial accumulation stage, the fast accumulation stage and the steady accumulation stage. As we are interested in identifying genes that are expressed in the seed during the fast accumulation period, we carried out a global analysis of the peanut transcriptome during seed development using the Illumina RNA-seq method. We also present an overview of the RNA-seq data for the peanut as a potential model for future RNA-seq analyses and to establish a biotechnological platform for peanut research.
Sample Preparation and Sequencing
Three peanut varieties, which greatly differ in oil content but are similar in mature process, were used in this study. Jihua 4 (JH4), a variety with a high oil content (57%), as determined in a 2009 field experiment in Hebei, China, was bred at the Hebei Institute of Agricultural Sciences and is widely grown in Northern China. Kaixuan01-6 (K01), a germplasm resource with a similar oil content (52%), also determined in the above field experiment in Hebei, was provided by the Yantai Institute of Agricultural Sciences in the Shandong Province of China. Te21 (T21), a germplasm resource with a low oil content (48%), as determined in China, was also used.
For each variety, RNA was isolated from ten immature seeds of five plants, which were harvested at 7 weeks after flowering during the 2010 growing season from a farm in Hebei province (Shijiazhuang, Hebei, China). Total RNA was extracted using a previously described method . The RNA quality and quantity were determined using an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA). Beads coated with oligo(dT) were used to isolate poly (A) mRNA after the total RNA was collected. Fragmentation buffer (Ambion, Austin, TX) was added to digest the mRNA to produce short fragments. The first strand of cDNA was synthesised using random hexamer primers, followed by synthesis of the second strand. The short fragments were purified with the QIAquick PCR Purification kit (Qiagen, Valencia, CA) for both end repair and the poly (A) addition reaction. The purified DNA libraries were amplified by PCR for 18 cycles. Finally, Solexa HiSeq™ 2000 was employed to sequence the libraries using PCR amplification (BGI, Shenzhen, China).
De novo Assembly and Analysis of Illumina Reads
The samples were assembled with SOAPdenovo  separately. The numbers of paired-end Illumina reads of JH4, K01, and T21 were 27,159,362, 26,938,530, and 26,142,148, respectively. The reads were first combined to form longer fragments, i.e., contigs. The reads were then mapped back to the contigs, and the paired-end reads and contigs from the same transcript were assembled to form a longer sequence, with N for unknown sequences (i.e., scaffolds). Paired-end reads were again used for gap filling of the scaffolds to obtain unigenes with the least Ns that could not be extended on either end. For future analyses, the unigenes from the three samples were assembled again to acquire non-redundant unigenes (All-Unigenes) that were as long as possible. The All-Unigenes assembled from the three samples were compared with the NCBI non-redundant (NR) protein database using blastx v2.2.14  with an E-value cut-off of 1e-5. Based on the results of the protein database annotation, Blast2GO  was employed to obtain the functional classification of the unigenes based on GO terms. WEGO software  was used to perform the GO functional classification for all of the unigenes and to understand the distribution of the gene functions of this species at the macro level. The KEGG database (V56.0, Oct. 1, 2010)  was used to annotate the pathway of these unigenes.
SSR mining and primer design
We employed MIcroSAtellite (MISA) http://pgrc.ipk-gatersleben.de/misa/ for microsatellite mining. In this study, the SSRs were considered to contain motifs with two to six nucleotides in size and a minimum of 5 contiguous repeat units. Based on MISA results, Primer3 v2.23 (http://primer3.sourceforge.net) was used to design the primer pairs with default setting, and the PCR product size ranging from 100 to 280 bp. Six varieties were selected to validated the polymorphism of 160 random SSR markers. The six varieties included three varieties (Jihua 2, Jihua 5, and SW9721), which were all bred at the Hebei Institute of Agricultural Sciences, and the three varieties (JH4, K01, T21) used in this study. The SSR data on the six varieties were obtained following the methods described by Liang .
Results and Discussion
Sequencing and De novo Assembly of Solexa Short Reads
Summary of the short reads and the assemblies for three varieties.
Total Nucleotides (nt)
Contribute to All-Unigenes
Assembling these reads produced 44,028, 47,110 and 44,157 unigenes for the JH4, K01 and T21 varieties, respectively; only unigenes greater than 200 bp in length were further analysed. The N50 values of these three unigenes were 664 bp, 616 bp and 616 bp, respectively. After the final clustering, 59,077 unigenes were obtained, with approximately equal contribution from varieties JH4 (74.53%), K01 (79.74%), and T21 (74.74%) (The peanut transcriptome sequences is available at National Centre for Biotechnology Information (NCBI) Transcriptome Shotgun Assembly (TSA) database, http://www.ncbi.nlm.nih.gov/, accession numbers are JR540742-JR590649). The length of the unigenes varied from 200 bp to 10654 bp, with an average of 619 bp, and the N50 value was 823 bp. The majority of the reads were in the range of 201-500 bp (64.80% of the unigenes), and 46,176 unigenes (78.16% of all of the unigenes) were shorter than 1 k. These results should provide a sequence basis for future studies, such as gene cloning and transgenic studies.
Characterisation of the unigenes
Among the 59,077 unigenes, the sequence directions of 42,997 of the unigenes were determined using blastx against the NCBI non-redundant (NR), Swiss-Prot, KEGG, and clusters of orthologous groups (COG) of protein databases with an E-value cut-off of 1e-5. In addition, the protein coding regions of 39,204 unigenes were predicted.
The unigenes were compared against the NCBI NR protein database using blastx. Among the 59,077 unigenes, 40,100 (67.88%) had at least one significant match with an E-value below 1e-5. A total of 18,977 unigenes had no significant matches to any known protein, the result that may be partly due to novel genes or highly divergent genes, or these unigenes could represent untranslated regions. For validating redundancy of the data set with publicly available data, sequence similarity search was conducted against the NCBI Unigene database ftp://ftp.ncbi.nlm.nih.gov/repository/UniGene/Arachis_hypogea/ using blastn with an E-value cut-off of 1e-10. The results indicated that out of 59,077 unigenes, 34,815 (58.93%) showed significant similarity to publicly unigenes database. All the information on the redundancy of the data set with the publicly available data were showed in the supplemental file (Additional file 1: Table S1).
Functional classification of the peanut unigenes
A total of 18,028 unigenes were annotated with 125 pathways in the KEGG database (V56.0, Oct. 1, 2010); metabolic pathways were the most enriched (3,899), followed by plant-pathogen interaction pathways (1,290). Some pathways, such as the fatty acid metabolism pathway and fatty acid biosynthesis, the functions of which are clearly linked to the changes in the seed oil that take place during peanut ripening, would characterise in more detail in another paper (Zhang et al., unpublished).
SSR mining from the peanut seed transcriptome
In this study, we performed a global characterisation of the peanut transcriptome by RNA-seq using next-generation Illumina sequencing. We generated 26.1-27.2 million paired-end reads, comprising 59,077 unigenes from three different varieties of peanut with different oil contents. These unigenes were annotated with 42 GO functional categories and 125 pathways. A total of 5,883 microsatellites were identified among the 59,077 unigenes, and 3,919 primer pairs were developed based on the sequence library. These data will facilitate gene discovery and functional genomic studies in peanuts. We gained insight into the complex transcriptome of the peanut and established a biotechnological platform for future research.
This work was supported by the earmarked fund for China Agriculture Research System (CARS-14), the Research Program on Quality-Improvement and Profit- Incensement for Major Economic Crops in Hebei Province (2011055005) and the Key Basic Research Program of Hebei Province (10960122D).
- Stephens AM, Dean LL, Davis JP, Osborne JA, Sanders TH: Peanuts, Peanut Oil, and Fat Free Peanut Flour Reduced Cardiovascular Disease Risk Factors and the Development of Atherosclerosis in Syrian Golden Hamsters. J Food Sci. 2010, 75: H116-H122. 10.1111/j.1750-3841.2010.01569.x.View ArticlePubMed
- Kris-Etherton PM, Pearson TA, Wan Y, Hargrove RL, Moriarty K, Fishell V, Etherton TD: High-monounsaturated fatty acid diets lower both plasma cholesterol and triacylglycerol concentrations. Am J Clin Nutr. 1999, 70: 1009-1015.PubMed
- Kerckhoffs DA, Brouns F, Hornstra G, Mensink RP: Effects on the human serum lipoprotein profile of beta-glucan, soy protein and isoflavones, plant sterols and stanols, garlic and tocotrienols. J Nutr. 2002, 132: 2494-2505.PubMed
- Griel AE, Eissenstat B, Juturu V, Hsieh G, Kris-Etherton PM: Improved diet quality with peanut consumption. J Am Coll Nutr. 2004, 23: 660-668.View ArticlePubMed
- Ge SR, Sui QW, Yu SL: Simply analysis of the situation and countermeasures of peanut production in China. J Peanut Sci. 1993, 4: 3-6.
- Ozcan MM: Some nutritional characteristics of kernel and oil of peanut (Arachis hypogaea L.). J Oleo Sci. 2010, 59: 1-5. 10.5650/jos.59.1.View ArticlePubMed
- Fraser GE, Sabate J, Beeson WL, Strahan TM: A possible protective effect of nut consumption on risk of coronary heart disease. The Adventist Health Study. Arch Intern Med. 1992, 152: 1416-1424. 10.1001/archinte.1992.00400190054010.View ArticlePubMed
- Bi YP, Liu W, Xia H, Su L, Zhao CZ, Wan SB, Wang XJ: EST sequencing and gene expression profiling of cultivated peanut (Arachis hypogaea L.). Genome. 2010, 53: 832-839. 10.1139/G10-074.View ArticlePubMed
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods. 2008, 5: 621-628. 10.1038/nmeth.1226.View ArticlePubMed
- Wang Z, Fang B, Chen J, Zhang X, Luo Z, Huang L, Chen X, Li Y: De novo assembly and characterization of root transcriptome using Illumina paired-end sequencing and development of cSSR markers in sweetpotato (Ipomoea batatas). BMC Genomics. 2010, 11: 726-10.1186/1471-2164-11-726.PubMed CentralView ArticlePubMed
- Wei W, Qi X, Wang L, Zhang Y, Hua W, Li D, Lv H, Zhang X: Characterization of the sesame (Sesamum indicum L.) global transcriptome using Illumina paired-end sequencing and development of EST-SSR markers. BMC Genomics. 2011, 12: 451-10.1186/1471-2164-12-451.PubMed CentralView ArticlePubMed
- Lu T, Lu G, Fan D, Zhu C, Li W, Zhao Q, Feng Q, Zhao Y, Guo Y, Huang X, et al: Function annotation of the rice transcriptome at single-nucleotide resolution by RNA-seq. Genome Res. 2010, 20: 1238-1249. 10.1101/gr.106120.110.PubMed CentralView ArticlePubMed
- Severin AJ, Woody JL, Bolon Y-T, Joseph B, Diers BW, Farmer AD, Muehlbauer GJ, Nelson RT, Grant D, Specht JE, et al: RNA-Seq Atlas of Glycine max: A guide to the soybean transcriptome. BMC Plant Biology. 2010, 10: 160-10.1186/1471-2229-10-160.PubMed CentralView ArticlePubMed
- Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008, 320: 1344-1349. 10.1126/science.1158441.PubMed CentralView ArticlePubMed
- Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, Penkett CJ, Rogers J, Bahler J: Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature. 2008, 453: 1239-1243. 10.1038/nature07002.View ArticlePubMed
- Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, et al: Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods. 2008, 5: 613-619. 10.1038/nmeth.1223.View ArticlePubMed
- Zenoni S, Ferrarini A, Giacomelli E, Xumerle L, Fasoli M, Malerba G, Bellin D, Pezzotti M, Delledonne M: Characterization of Transcriptional Complexity during Berry Development in Vitis vinifera Using RNA-Seq. Plant Physiology. 2010, 152: 1787-1795. 10.1104/pp.109.149716.PubMed CentralView ArticlePubMed
- Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, et al: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008, 321: 956-960. 10.1126/science.1160342.View ArticlePubMed
- Filichkin SA, Priest HD, Givan SA, Shen R, Bryant DW, Fox SE, Wong WK, Mockler TC: Genome-wide mapping of alternative splicing in Arabidopsis thaliana. Genome Res. 2010, 20: 45-58. 10.1101/gr.093302.109.PubMed CentralView ArticlePubMed
- Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18: 1509-1517. 10.1101/gr.079558.108.PubMed CentralView ArticlePubMed
- Fullwood MJ, Wei CL, Liu ET, Ruan Y: Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses. Genome Res. 2009, 19: 521-532. 10.1101/gr.074906.107.PubMed CentralView ArticlePubMed
- Torres TT, Metta M, Ottenwalder B, Schlotterer C: Gene expression profiling by massively parallel sequencing. Genome Res. 2008, 18: 172-177.PubMed CentralView ArticlePubMed
- Hegedus Z, Zakrzewska A, Agoston VC, Ordas A, Racz P, Mink M, Spaink HP, Meijer AH: Deep sequencing of the zebrafish transcriptome response to mycobacterium infection. Mol Immunol. 2009, 46: 2918-2930. 10.1016/j.molimm.2009.07.002.View ArticlePubMed
- Clark MS, Thorne MA, Vieira FA, Cardoso JC, Power DM, Peck LS: Insights into shell deposition in the Antarctic bivalve Laternula elliptica: gene discovery in the mantle transcriptome using 454 pyrosequencing. BMC Genomics. 2010, 11: 362-10.1186/1471-2164-11-362.PubMed CentralView ArticlePubMed
- Iorizzo M, Senalik DA, Grzebelus D, Bowman M, Cavagnaro PF, Matvienko M, Ashrafi H, Van Deynze A, Simon PW: De novo assembly and characterization of the carrot transcriptome reveals novel genes, new markers, and genetic diversity. BMC Genomics. 2011, 12: 389-10.1186/1471-2164-12-389.PubMed CentralView ArticlePubMed
- Graham IA, Besser K, Blumer S, Branigan CA, Czechowski T, Elias L, Guterman I, Harvey D, Isaac PG, Khan AM, et al: The genetic map of Artemisia annua L. identifies loci affecting yield of the antimalarial drug artemisinin. Science. 2010, 327: 328-331. 10.1126/science.1182612.View ArticlePubMed
- Oliver RE, Lazo GR, Lutz JD, Rubenfield MJ, Tinker NA, Anderson JM, Wisniewski Morehead NH, Adhikary D, Jellen EN, Maughan PJ, et al: Model SNP development for complex genomes based on hexaploid oat using high-throughput 454 sequencing technology. BMC Genomics. 2011, 12: 77-10.1186/1471-2164-12-77.PubMed CentralView ArticlePubMed
- Chen SL, Li YR, Xu GZ, Cheng ZS: Simulation on Oil Accumulation Characteristics in Different High-Oil Peanut Varieties. Acta Agronomica Sinica. 2008, 34: 142-149. 10.3724/SP.J.1006.2008.00142.View Article
- Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, et al: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010, 20: 265-272. 10.1101/gr.097261.109.PubMed CentralView ArticlePubMed
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMed
- Conesa A, Gotz S, Garcia-Gomez JM, Terol J, Talon M, Robles M: Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics. 2005, 21: 3674-3676. 10.1093/bioinformatics/bti610.View ArticlePubMed
- Ye J, Fang L, Zheng H, Zhang Y, Chen J, Zhang Z, Wang J, Li S, Li R, Bolund L: WEGO: a web tool for plotting GO annotations. Nucleic Acids Res. 2006, 34: W293-W297. 10.1093/nar/gkl031.PubMed CentralView ArticlePubMed
- Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, et al: KEGG for linking genomes to life and the environment. Nucleic Acids Res. 2008, 36: D480-D484.PubMed CentralView ArticlePubMed
- Liang X, Chen X, Hong Y, Liu H, Zhou G, Li S, Guo B: Utility of EST-derived SSR in cultivated peanut (Arachis hypogaea L.) and Arachis wild species. BMC Plant Biology. 2009, 9: 35-10.1186/1471-2229-9-35.PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.