A transcriptional landscape of 28 porcine tissues obtained by super deepSAGE sequencing

Background Gene expression regulators identified in transcriptome profiling experiments may serve as ideal targets for genetic manipulations in farm animals. Results In this study, we developed a gene expression profile of 76,000+ unique transcripts for 224 porcine samples from 28 tissues collected from 32 animals using Super deepSAGE technology. Excellent sequencing depth was achieved for each multiplexed library, and replicated samples from the same tissues clustered together, demonstrating the high quality of Super deepSAGE data. Comparison with previous research indicated that our results not only have good reproducibility but also have greatly extended the coverage of the sample types as well as the number of genes. Clustering analysis revealed ten groups of genes showing distinct expression patterns among these samples. Our analysis of over-represented binding motifs identified 41 regulators, and we demonstrated a potential application of this dataset in infectious diseases and immune biology research by identifying an LPS-dependent transcription factor, runt-related transcription factor 1 (RUNX1), in peripheral blood mononuclear cells (PBMCs). The selected genes are specifically responsible for the transcription of toll-like receptor 2 (TLR2), lymphocyte-specific protein tyrosine kinase (LCK), and vav1 oncogene (VAV1), which belong to the T and B cell signaling pathways. Conclusions The Super deepSAGE technology and tissue-differential expression profiles are valuable resources for investigating the porcine gene expression regulation. The identified RUNX1 target genes belong to the T and B cell signaling pathways, making them novel potential targets for the diagnosis and therapy of bacterial infections and other immune disorders.


Background
The domestic pig (Sus scrofa) is an important animal farmed for meat worldwide and has been used as an alternative model for studying genetics, nutrition, and disease [1][2][3]. The swine research community has created a large database of the pig transcriptome [4]. The recently released pig genome sequence (S. scrofa 10.2) [5] and associated annotation greatly enhance our knowledge of pig biology [6,7]. Currently, it is estimated that the porcine genome encodes for ∼20,000 genes [5]. Transcriptome analysis indicates that, of the total, actively transcribed genes represent only a mere fraction of 15,000 genes in all tissues [8]. Several research groups have created microarray transcriptome profiling data for humans [9,10], mouse [11,12], and rat tissues [13]. In the pig, several Expressed Sequence Tag (EST) sequencing projects, microarray platforms, long-SAGE, and deep sequencing projects have developed gene expression profiles across a range of tissues [8,14,15]. In comparison to other model organisms, the pig transcriptome data has its limitations in terms of coverage of tissues and genes [4]. Here, we present Super deepSAGE (serial analysis of gene expression by deep sequencing) profiling data for pig tissues with wide gene coverage and annotation. Using the K-means clustering analysis and motif binding site enrichment analysis, we have identified key regulators for co-expressed genes. A detailed analysis of one such identified transcription factor, RUNX1, illustrates the impact of the data.

Results and discussion
Analysis of the complexity and diversity of super deepSAGE data across tissues Super deepSAGE obtained~5 million reads per sample with an average sequencing depth of 71X (total number of genes identified by deep sequencing / total number of aligned reads, sequencing matrix is listed in Supplemental document 1). A total of 32,213 transcripts were covered by Super deepSAGE. Rarefaction analysis of a sizefractionated library for each tissue was performed to determine the complexity and diversity of pig tissues [16]. The sequencing depth achieved using eight samplesmultiplexed deep sequencing technique (added different linker and pooled eight samples together to a single deep sequencing run) reached near-saturation of transcript discovery within all size ranges. Saturation was seen very early in Super deepSAGE sequencing data due to low tag complexity (number of tags) in libraries ( Fig. 1a-f showed the first six deep sequencing runs). Samples from the same sequencing run were compared using reads from different size-fractionated libraries to further investigate the diversity of the relationship between sequencing depth and transcript discovery. In all deep sequencing runs, tissues exhibited transcriptome diversity Fig. 1 Rarefaction analysis of covered genes/transcripts in porcine tissues and cells Super deepSAGE library. Plot a to f shows the covered Kilo transcripts per Kilo reads in the first six Super deepSAGE sequencing runs. The samples in each sequencing run were randomized and detailed information is given in Table 1 in terms of both the total number of reads and the number of transcripts discovered. For example, the muscle tissue (MS.DI_2), saturated much sooner than the conceptus (CPT.SPH_8) and fewer transcripts were discovered in the first deep sequencing run (Fig. 1a). Similar sequencing depth and diversity were obtained using size-fractionated reads numbers from the other 22 sequencing run and discovered transcript numbers as outcome measures (Supplemental Fig. SA-D).
Data quality and internal consistency control using principal component analysis (PCA) Principal component analysis (PCA) was used to check if the samples clustered together according to their tissue source [17]. Even though the samples were collected from 32 individual animals from different families, genders, and ages (Table 1), the PCA plot confirmed that the samples from the same tissues clustered together and were distinct from other samples (Fig. 2). The transcripts in conceptus, blood, and macrophages had relatively distinct expression profiles and segregation from the rest of the samples when plotted using the first two components of the PCA analysis (Fig. 2a). The adenohypophysis, cerebral cortex, heart, and muscle were aggregate and separated from other samples when plotted using the third and fourth components (Fig. 2b). The adrenal, liver, mesenteric lymph nodes, peripheral blood mononuclear cell, and spleen deviated from other samples when plotted using the fifth and sixth components (Fig. 2c). When eliminating those samples from the datasets and re-calculating the PCAs, the remaining samples; fat, placenta, endometrium, kidney, lung, and stomach grouped differently according to the tissue/cell types ( Fig. 2d-f). Tissues having similar cellular composition and biological function, like alveolar and monocytederived macrophages or heart and skeletal muscles, clustered closely together but were distinct from each other.

Comparison of the super deepSAGE data with previously published microarray research
The expression profiles were compared with previously published microarray data [8]. The processed microarray datasets were acquired from the GEO database and normalized to the Super deepSAGE data using the quantile normalization method to make these two datasets comparable. There is a total of 8199 common transcripts for seven tissues in both platforms, a total of 24,013 transcripts remain undetected by the Affymetrix platform, and a total of 4478 transcripts were undetected in Super deepSAGE experiments (Fig. 3). Among the commonly detected transcripts, a high correlation (r = 0.85-0.93 and p-values less than 1.0 × e − 30 ) was calculated between the gene expression profiles generated by the two platforms (Fig. 3). A similar dynamic range was observed in both platforms for transcripts with a relative expression level (log2 based and quantile normalized expression value) between 4.0 and 9.0. Differences in expression profiles were apparent between the two platforms as several genes exhibiting relatively higher or lower expression values in either platform deviated from the slope (Fig. 3). All transcripts had an expression value in the microarray due to background hybridization or noise, regardless of whether it was truly expressed or not. The overall dynamics of the fitted curve tend to show that the Super deepSAGE platform is a more sensitive technique than the microarray for low expression genes that show a concaved trend at the lower ends (with relative expression level less than 4.0 in Fig. 3). For those genes with high expression levels, variability is high in both Super deepSAGE and microarray platforms. In the seven overlapped tissues between Super deepSAGE and microarray, the 50 highest expressed Super deepSAGE tags, 38 (76%) found corresponding probe sets in the 50 highly expressed genes, and only three tags showed a Identification of tissue-differential expression of transcripts A total of 4165 transcripts showed significant up or down-regulation in at least one tissue, in comparison to the average tag count for 27 tissues. K-means clustering analysis was then performed by trying a different number of centers (K from 5 to 28) and several random sets (S from 10 to 1000). An ad hoc method comparing each tissue to the average tag count for all 27 tissues was performed, and a very stringent threshold was set (fold change > 5.0, p-value < 1.0 × 10 − 6 ) to filter the tissues specifically expressing transcripts. We selected K = 10 and S = 400 to produce a clustered result with a clear expression pattern (by visualization), highly reproducible for each duplicated run (Fig. 4). The detailed clustering information is available in Supplemental document 2. The result indicated that Cluster 1 has the largest number of transcripts, and most of these transcripts were expressed at a low level in tissues, except macrophages, PBMCs, blood, and conceptus in which it was moderately expressed. The conceptus expressed transcripts were in Cluster 2, while the conceptus, macrophages, PBMCs, and blood down-expressed transcripts were in Cluster 4. The macrophages, PBMCs, blood, mesenteric lymph nodes, and spleen specific transcripts were in Cluster 5. The genes specifically expressed in the heart and skeletal muscles were in cluster 10. The cerebral cortex specific genes were in Cluster 6, and liver specifically expressed transcripts were in Cluster 7. The adrenal cortex, adrenal medulla, cerebral cortex, and adenohypophysis specific transcripts were in Cluster 8. Transcripts in Cluster 3 and Cluster 9 were ubiquitously expressed in multiple tissues.

Identification of over-represented motif for tissues specifically expressed transcripts
The CLOVER software [18] with JASPAR PWM database [19] was used to identify over-represented transcription factor binding motifs for each gene cluster. The promoter regions for a transcript cluster (1000 bp upstream from the TSS) were determined using the Ensemble Biomart tool (Sus scrofa assembly 11.1, gene 99) [20]. The promoter regions for the transcripts detected, with a similar GC content, were used as background.
Motifs having a p-value of ≤0.05 was significant ( Table 2, top 5 motifs). The most significantly enriched motif in Class 1 is MZF1. TFAP2A and TFAP2C were also significantly enriched with a raw score higher than 30. In Class 2, there was only one significantly enriched motif, RHOXF1. In Class 3 and 4, there were five and four motifs with p-value < 0.05 respectively, but the raw score was lower than ten. In Class 5, there were at least five motifs with p-value < 0.05, and three of them, RUNX1, ASCL1, and Myod1 had a raw score higher than 30. In Class 6, the significantly enriched motifs with the highest score were SNAI2 and FIGLA, whereas, in Class 7, the significantly enriched motifs with the highest score was NR4A2. In Class 8, there was only one motif ZEB1 enriched in the promoter region of these transcripts. In Class 9, all the enriched motifs had a raw score of less than ten. In Class 10, the top three motifs were Ascl2, Myog, and Tcf12.
Case report: confirmation of the regulatory roles of RUNX1 in PBMCs in pig In the cluster heatmap ( Fig. 4), Class 1 and 5 tentatively show (by visualization) high expression in macrophages, PBMCs, and blood. However, the expression level of genes in Class 1 was lower than in Class 5. Further, mesenteric lymph nodes and spleen specific transcripts in Cluster 5 indicated that this class is an immunity-related gene cluster. The top over-represented motif in Class 5 is RUNX1, and literature search of its targets indicated that TLR-2 (Toll-like receptor), LCK (tyrosine kinases), and VAV1 (Rho family GTPases) play a role in T and B-cell development and activation. These three representative RUNX1 targets were selected for further experimental validation.
Confirmation of the RUNX1 binding site in the promoter region of TLR-2, LCK, and VAV1 The toll-like receptor 2 (TLR-2), lymphocyte-specific protein tyrosine kinase (LCK), and vav1 oncogene (VAV1) plasmid containing the 1Kb putative promoter sequence were used in in vivo studies (wild type). To show the regulatory effect of RUNX1, the binding site of RUNX1 in TLR-2, LCK, and VAV1 was mutated or deleted. Reporter vectors constructed by the wild type, mutated, or deleted promoter sequences were transfected into the peripheral blood mononuclear cells (PBMCs), and luciferase activity was monitored. Binding site deletion significantly attenuated the expression of the downstream reporter luciferase activity (p < 0.05), indicating that RUNX1 could interact with the target site and regulate the expression of the downstream reporter gene (Fig. 5a- Fig. 5d-f). The luciferase reporter activity after transfection with the wild-type vector was significantly higher in macrophage cells than in the PBMC assays, suggesting that the endogenous RUNX1 expression in mouse macrophage cells was higher than in PBMCs.

RNA flow cytometry analysis of RUNX1 targets in LPS and RUNX1 inhibitor-treated PBMCs
To show the effect of RUNX1 on three targets; TLR2, LCK, and VAV1, pig PBMCs were stimulated with LPS and/or RUNX1 inhibitor, for 6 h, during which their TLR2, LCK, VAV1, CD14 protein levels were monitored. Two subsets of cells readily emerged from CD14/TLR2 analysis in PBMCs: a CD14 hi /TLR2 lo (CD14 high /TLR2 low ) and a CD14 lo /TLR2 lo population (Fig. 6d)   Samples were collected six hours post-stimulation. A total of 21 genes were induced significantly in response to at least one dose of LPS stimulation, as expression levels for these genes were different when compared to non-stimulated control. A total of 10 genes were downregulated significantly in response to the RUNX1 inhibitor treatment. Hierarchical clustering analysis was used to determine whether the response of LPS stimulation was similar to the patterns detected in RUNX1 inhibitor treatment, and if any differences were observed depending on the dosage of LPS and RUNX1 inhibitor used. As shown in Fig. 7, the genes were grouped into two large clusters. In the upper part of the cluster (from CLEC5A to SLA) the expression patterns of samples with RUNX1 inhibitor treatment, RUNX1 inhibitor plus LPS treatment, and non-simulated controls tend to be similar.  affect the samples. The expression patterns of RUNX1 inhibitor plus LPS treatment samples tend to be similar to controls and RUNX1 treatment only samples because of the overlap in the heatmap. In the lower part of the cluster (from ARHGDIB to RGS18), the expression patterns of some of the samples following RUNX1 inhibitor treatment, RUNX1 inhibitor plus LPS treatment, and non-simulated controls tend to be similar with the LPS treated samples. The reason for this discrepancy could be due to variation in preparing the PBMCs and the stimulation process which are difficult parameters to control.

Discussion
This study created transcriptomic datasets by Super deepSAGE using a large number of samples and a large number of biological repeats for the same tissue, and detected a large number of transcripts. To our knowledge, our work represents the largest tissue dataset profiled in a single study. We first analyzed the homogeneity of transcriptional information within samples of the same organ using PCA. Without prior information, we showed that samples from the same organ but different donors clustered together. This result is remarkable given the potential variability that could have been introduced for each sample with respect to the different animals. To analyze of similarities and differences between the expression profiles for different tissues, we used hierarchical clustering analysis. The dendrogram generated by this analysis reflected functional relationships between tissues. Our results also showed that gene expression profiling can distinguish each of the tissues. The data demonstrated that tissue-differentially expressed genes can be distinguished to gene clusters which share similar expression profiles and those genes were coregulated by common regulator genes. We found that genes in gene clusters were always expressed in tissue-specific manner. These findings are consistent with and strengthen the rationale for transcription rules for mammalian that transcription of genes turned on when and only when they were required. The transcription factors interact with the DNA recognition motifs, and regulate transcription of a large number of genes, and plays an important role in the regulation of tissue-differentially expressed genes. To interpret and understand gene regulation from the Super deep-SAGE data obtained in this study, identifying the overrepresented or under-represented motifs in the sequence showing similar expression patterns and factors binding to them, is necessary. Over-represented motifs were identified to play a regulatory role in the sequences, while underrepresentation indicated that the motif would have a harmful dis-regulatory effect.
RNA-seq can easily provide a much larger yield, have a large dynamic range and identify a larger number of genes and transcripts. However, literature shows limited number of studies using RNA-seq technology have been accomplished that run in parallel for a wide range of porcine tissues. The Super deepSAGE reduced cost by multiplexing and obtained data with good quality in terms of sequencing depth, gene coverage, and reproducibility. Few discrepancies were observed when comparing Super deepSAGE data with published microarray data. We predict two possibilities that could cause such discrepancies: 1) the SAGE tag was derived from two or more different transcripts, which were differentially expressed in the samples tested, and 2) the microarray probe set can target two or more transcripts due to sequence similarity. For example, the transcripts from the same gene family will always produce the same SAGE tag (attributable to the lower resolution power of Super deepSAGE) and preferred to hybridize to the same microarray probe set (can be minimized by design probe sets in the non-conserved region). Regardless of the discrepancies, we conclude that Super deepSAGE data is overall compatible with the microarray data and provide reliable gene expression profiles.
Among the transcription factors identified for tissuedifferentially expressed genes, RUNX1 is a master regulator of hematopoiesis and plays a vital role in T and B cell development. RUNX1 is critical in induction of the immune cells, such as interleukin-2 (IL-2, [21], IL-3 [22], colony-stimulating factor 1 receptor (CSF1R, [23], CSF2 [24], and cluster of differentiation 4 (CD4), [25]. However, its roles in LPS-mediated inflammation in PBMCs remains unclear. In this study, regulations of TLR-2, LCK, and VAV1 have been confirmed by flow cytometry. TLR2 is an essential receptor for the recognition of a variety of pathogen-associated molecular patterns (PAMPs) from Gram-positive bacteria, including bacterial lipoproteins, lipomannan, and lipoteichoic acids [26]. LCK encoded protein is a key signaling molecule in the selection and maturation of developing T-cells [27]. The VAV1 encoded protein is important in hematopoiesis, playing a role in both T-cell and B-cell development and activation [28,29]. These results suggest that RUNX1 might be a new potential target for resolving inadequate or uncontrolled inflammation in PBMCs.

Conclusions
Gene expression analysis is extensively applied in the understanding of the molecular mechanisms underlying a wide range of biological processes such as hostpathogen interactions. Our dataset (of transcript levels in tissues) can serve as a reference dataset for comparison of expression analysis to detect aberrations in transcript levels of various biological functions. Therefore, the major focus of this manuscript was to demonstrate the biological importance of these profiles. We report that > 40% of the measured transcripts were differentially expressed between different tissues. We show that statistically the transcripts were co-regulated by a few important transcription factors. Our study led to the identification of key transcription factors that regulate gene expression in PBMCs. This data will improve the annotation of the pig genome, support biological studies and increase the utility of the pig as a meat source and model in medical research.

Development of super deepSAGE technology
A flowchart of the Super deepSAGE experiment is summarized in Fig. 8. Dynabeads® M − 270 Amine (Thermo Fisher Scientific, China) were coupled with -C6-SH labeled reverse transcription-primer with the sequence containing the 5′-CAGCAG-3′ recognition site of EcoP15I and an Oligo (dT) sequence at the 3′ end, intentionally designed to complement the poly(A) sequence of mRNAs (Synthesized by Sangon Biotech, China). The coupling procedure was carried out as outlined in the protocol reported by Hill and Mirkin [30] using the succinimidyl 4-(p-maleimidophenyl) butyrate (SMPB) crosslink reagent (Thermo scientific, Shanghai, China). Ten micrograms of mRNA were reversetranscribed (cDNA synthesis system, Invitrogen) with the Oligo (dT) magnetic beads to generate single-stranded cDNA using the manufacturer protocol. The product was converted to double-stranded cDNA using random primers and then digested with NlaIII (NEB, Beijing, China). The biotin-labeled linkers (linker-5EA) with phosphorylated 5′ termini and 3′ end overhangs (5′-CATG-3′) contain the EcoP15I recognition site were prepared by annealing commercially synthesized oligonucleotides. The magnetic beads-bound cDNA was washed and bound to linker-5EA by T4 DNA ligase (NEB, Beijing, China). As a result, each cDNA fragment bound to the magnetic beads was flanked by two inverted repeats of EcoP15I recognizing sites. The type III restriction enzyme EcoP15I cleaves the DNA downstream of the recognizing site (25 nt in one strand and 27 nt in the other strand) leaving a 5′ end overhang of two bases [31,32]. Linker-ligated cDNAs on the magnetic beads were digested with ten units of EcoP15I under conditions described previously [33]. The supernatant containing biotin-labeled fragments were added to streptavidin magnetic beads (Promega, Beijing, China), and the biotinlabeled fragments of the cDNA were captured. Finally, barcoded linkers (linker-3EA) with two random base overhangs at 5′ end and phosphorylated termini were prepared and ligated to the cDNA ends by T4 DNA ligase (NEB, Beijing, China). The resulting products were amplified by polymerase chain reaction (PCR), and the 119 bp product was separated by polyacrylamide gel electrophoresis (PAGE) and recovered from the gel. The barcoded libraries prepared from different samples were combined into a Fig. 8 Flowchart of Super deepSAGE library construction. There are three major steps included in the protocol: a) reverse transcription with oligo (dT) coupled magnetic beads, synthesis of the secondary chain, and digestion with NlaIII; b) add 5′ end linker and digest with EcoP15I, and c) add 3′ end linker and PCR amplification. For details see materials and methods section single multiplex sequencing reaction at the end of library construction and submitted for deep sequencing. The sequence information of synthetic oligos, linkers, and primers are available in Supplemental document 4.
The serial analysis of gene expression (SAGE) was first developed by Velculescu et al. [34] and improved by Saha et al. [35], Matsumura et al. [33], and Nielsen et al. [36]. The traditional SAGE library construction protocol includes multiple steps, and the separation of the linkertag fragment is challenging to perform, and the PAGE purification often produces low yield. The library construction protocol in this study was improved by introducing two magnetic beads: 1) Dynabeads® M − 270 Amine coupled with -C6-SH labeled Oligo (dT) reverse transcription primer; 2) The streptavidin magnetic beads which can capture biotin-labeled linkers (linker-5EA). The magnetic beads used in this protocol can capture and purify the DNA fragments and is technically less demanding than PAGE separation. This modification increased the yield of linker-tag fragments and resulted in the robustness of the technique. Also, the primers and linkers were designed compatible with multiplexed deep sequencing technology, eliminating additional sequencing costs.

Animals, samples collection, and deep sequencing
Tissue samples were collected from a private slaughter farm (Jingzhou) located in the Hubei province in China. Following anesthesia by electric shock, specimens were excised, snap-frozen in liquid nitrogen, and kept in a deep freezer (− 80°C) until RNA extraction. The sample collection was approved by the Animal Care and Use Committee of Hubei Province (China, YZU-2018-0031). RNA isolation and purification from tissues and cells was done using the RNeasy Mini Kit (Qiagen, Shanghai, China) following the manufacturer's protocols. The BioAnalyzer 2100 (Agilent) was used to assess the integrity of total RNAs, and RIN number of less than 7.0 was eliminated from the study. After the experiment, the animals were slaughtered and sold. A total of 224 tissue samples across 28 different tissues were collected. The samples were collected from 32 animals from a Duroc × Landrace × Yorkshire (DLY) commercial crossbreed pig populations consisting of 16 males and 16 females at weaning dates of 21 days. The endometrium, placenta, and conceptus were collected from Landrace × Yorkshire (LY) sows of 12 days of gestation. Bone-marrow-derived macrophages were obtained by culturing bone marrow cells for 5-7 d in the presence of CSF-1 (0.5 ng/ml), as described previously for mouse macrophages [37]. Monocyte-derived macrophages were prepared according to a method published by Gao Y. et al. [38]. Monocyte-derived macrophages were cultured from CD14 + blood monocytes and polarization towards macrophage phenotypes was achieved by treating with M-CSF following a protocol published by Jaguin M. et al. [39]. The detailed sample information is summarized in Table 1 and Supplemental document 5. A total of eight samples (four male and four female) for each tissue were randomly selected form the 16 animals and submitted for deep sequencing. The computational extraction of tags from sequence data by the program (in-house designed) removes the two bases at the 5′ end. This 'digital removal' is performed to minimize the less accurate effect of two random bases, at the 5′ end of linker-3EA, and could potentially reduce the length of the tags, and affect the representative ability of the data. However, a direct link with a linker that has two random bases at the 5′ end forming overhangs will 1) enhance the efficiency of the link assay, and 2) no additional blunt ending process was needed to be compared with traditional SAGE method. These two bases were removed by the 'reads filtering' procedure, thereby lowering the systematic bias in the data.
Libraries were constructed using the Illumina TruSeq Small RNA Sample Prep Kit and were sequenced on an Illumina HiSeq2000 sequencer (50 bp and single-read) (Illumina Inc., USA) [40]. The sequencing data was filtered by a quality score (poor score < 0.5) for more than 20% of all the bases and then assessed using FastQC [41]. All the data discussed in this study have been deposited to the NCBI GEO database [42] under accession number GSE134461. Tag sequence was extracted, counted, and assigned for each transcript (Sus scrofa assembly 11.1, gene 99) using the SAGE software (modified to allow extraction of 19 bp tags), and then normalized for each sample by quantile normalization method [43]. For the tags assigned to multiple transcripts, the average copy numbers of those tags were used. The processed data is available to the reader in Supplemental document 5. The principal component analysis (PCA) was performed using the log 2 tag counts of transcripts across all samples using statistical analysis with R software version 3.5. The tissue-specific transcripts were identified by comparing samples from each tissue to the overall tag count across all samples, and a threshold was set to fold change > 5.0, p-value < 1.0 × 10 − 6 according to a method implemented in the limma package [44]. The limma mode used was "model.matrix(0 +factor(c (target_tissue, other_tissue)))". Clustering analysis was performed by first using the K-means clustering method to separate transcripts into big groups, and then using Hierarchical clustering to build the internal structure of the transcripts within the groups according to the method reported by Gu et al. [45,46]. Global-seeding procedures of BF98 [47] have been introduced into the Kmeans clustering algorithm to improve the consistency and quality of clustering results. The BF98 method employed a bootstrap-type procedure to determine the initial seeds for the centers. Several subsamples (recommended n = 10) of the dataset were clustered using K-means. Each clustering operation produced a different candidate set of centroids from which a new data set was constructed. This dataset was clustered using K-means, and the centroids were chosen as the initial seeds. The optimal BF98 clustering result on the Super deepSAGE data was obtained by "visualization" of the result performed by using K = 10 and the number of subsamples S = 1000 after trying K from 5 to 28 and S from 20 to 1000. The "visualization" method is straightforward for determining the best parameter for the K-means clustering procedure, but when the K reached 10, definite, compact and representative gene clustering was formulated, and when the S is higher than 200, consistent clustering results were produced for each duplicated clustering run.

Luciferase reporter assay
A 1Kb nucleotide promoter segment of TLR-2, LCK, and VAV1 that included RUNX1 target sites was inserted upstream of a firefly luciferase ORF (pGL3, Promega, Beijing, China), and luciferase activity was compared to that of an analogous reporter with point substitutions disrupting the target sites, or analogous reporter with the binding site deleted completely. The logic behind the luciferase reporter assay is that deletion/mutation of a RUNX1 binding site should allow the down-regulation of its target genes, and hence the target gene should be expressed differently between the wild type and mutated constructs. The pGL3-Control activity was used for the normalization of firefly luciferase activity. For the assay, the cells were plated in a 96-well plate at 3000 cells per well. After overnight incubation, the cells were treated with a transfection reagent mixture consisting of 35 μL of serum-free medium, 0.3 μL of TransFast™ Transfection Reagent (Cat. E2431), and 0.02 μg of pGL3 and pGL3-Control vector per well. After a one-hour incubation, 100 μL of the serum-containing medium was added to the wells. 24 to 48 h posttransfection, EnduRen™ Live Cell Substrate (Cat. E6481) was added at a final concentration of 60 μM, and luciferase activity was monitored.

PBMC isolation and stimulation
Peripheral blood mononuclear cells (PBMCs) were isolated from whole blood collected from five animals aged 21 days using BD Vacutainer VR Cell Preparation Tubes (Becton Dickinson, Shanghai, China). The samples were processed according to the manufacturer's instructions within two hours of blood collection. PBMCs were harvested from the tube, washed with phosphate-buffered saline (Life Technologies), and centrifuged for 10 min at 300 g before use. To induce gene expression, PBMCs were resuspended in RPMI-1640 medium (Life Technologies) supplemented with 10% fetal bovine serum (Life Technologies) at 1.5 × 10 6 cells/mL in a 96-well V-bottom polypropylene plate (Corning Incorporated). LPS (Sigma-Aldrich, Shanghai, China) and RUNX1 inhibitor (Ro 5-3335, R&D Systems, Shanghai, China) were added at 5 ng/mL and 10 ng/mL, respectively, according to the manufacturer's instructions. Untreated PBMCs were used as control samples.

Surface staining and cytometry acquisition
Phenotypic surface staining was performed in BD Phar-mingen™ stain buffer (BSA, BD Biosciences, Shanghai, China) for 30 min at room temperature in the dark, using anti-CD14 PE (BD Biosciences, Shanghai, China). Cells were washed and suspended in BD Pharmingen stain buffer (BSA, BD Biosciences, Shanghai, China), anti-TLR-2 FITC, anti-LCK FITC, anti-VAV1 FITC (BD Biosciences, Shanghai, China), was then added separately, and the mixture was incubated for 20 min at room temperature. Finally, cells were washed and acquired on a BD LSRFortessa™ cell analyzer (BD Biosciences, Shanghai, China). The flow cytometry data was deposited in the Flow Repository database [48] under accession number FR-FCM-Z268.