- Research article
- Open Access
Elucidating tissue specific genes using the Benford distribution
- Deepak Karthik†1,
- Gil Stelzer†1,
- Sivan Gershanov1,
- Danny Baranes1 and
- Mali Salmon-Divon1Email authorView ORCID ID profile
© The Author(s). 2016
- Received: 3 March 2016
- Accepted: 7 July 2016
- Published: 9 August 2016
The RNA-seq technique is applied for the investigation of transcriptional behaviour. The reduction in sequencing costs has led to an unprecedented trove of gene expression data from diverse biological systems. Subsequently, principles from other disciplines such as the Benford law, which can be properly judged only in data-rich systems, can now be examined on this high-throughput transcriptomic information. The Benford law, states that in many count-rich datasets the distribution of the first significant digit is not uniform but rather logarithmic.
All tested digital gene expression datasets showed a Benford-like distribution when observing an entire gene set. This phenomenon was conserved in development and does not demonstrate tissue specificity. However, when obedience to the Benford law is calculated for individual expressed genes across thousands of cells, genes that best and least adhere to the Benford law are enriched with tissue specific or cell maintenance descriptors, respectively. Surprisingly, a positive correlation was found between the obedience a gene exhibits to the Benford law and its expression level, despite the former being calculated solely according to first digit frequency while totally ignoring the expression value itself. Nevertheless, genes with low expression that exhibit Benford behavior demonstrate tissue specific associations. These observations were extended to predict the likelihood of tissue specificity based on Benford behaviour in a supervised learning approach.
These results demonstrate the applicability and potential predictability of the Benford law for gleaning biological insight from simple count data.
- Benford law
- Gene expression
RNA-seq is a very common application in biology to examine features of the transcriptome and global patterns of gene expression. The rapid development of massively parallel sequencing or next-generation sequencing (NGS) [1, 2] together with the reduction in sequencing cost and the maturation of analytical tools for the analysis of the data made this application a standard practice in molecular biology and medical studies. In recent years, there is a huge accumulation of RNA-seq data available in public biological databases, opening new opportunities for studying general patterns of gene expression in biological and medical systems. This copious data may now be examined using postulations that require vast information for their objective testing, such as the Benford law.
The Benford law, also known as the first digit law, contradicts intuition, by which one would assume that in any given series of numbers, the frequency of all nine digits appearing in the most significant (left-most) numeric position would be equal. The Benford law states that in naturally occurring datasets the larger digits have a lower likelihood to occur in the first digit position . This law was discovered by Newcomb in 1881 who examined tables of logarithms and noticed that the first pages were used more often, as indicated by finger print stains, than later pages . In 1938, Frank Benford re-discovered this phenomenon and tested it on different types of count data, including population size of different cities, rivers length, heat constants, atomic weights, electricity bills and many more . Today, the Benford law is used mainly for detecting fraudulent activity in accounting and tax data reports [5, 6]. The idea of using Benford’s Law to screen data is based on the observation that regular, “naturally generated” data usually follow a logarithmic distribution, while faked data show abnormalities in the distribution .
Although the Benford law is known for many years, its application in biological systems was barely investigated. Benford’s law was found to be applicable to normal growth of human as well as bacterial populations [3, 8, 9]. Costas et al. found that the distribution of cell number per colony of a bacterium M. aeruginosa collected from different locations obeys the Benford law . Grandison et al.  demonstrated that kinetic rate parameters of biological pathways follow Benford law closely. Kreuzer et al.  directly correlated changes in first digit distributions of EEG data with different states of anaesthesia. In the realm of genomics, it was shown that the number of ORFs for Eukaryotes follows a Benford distribution , Hoyle et al.  showed that microarray spot intensities, which are correlative to messenger RNA abundance, follow Benford distribution. Generally, first digit distribution can be used to monitor the consistency of the experimental process, and data quality [14–17].
Here we tested whether digital gene expression data (RNA-seq), generated by NGS platforms that have become the obvious choice for expression experiments, adhere to the Benford distribution. In contrast to microarray data, RNA-seq technology reflects the actual count of RNA molecules rather than inferring expression from relative spot intensity. We examined if deviation from the Benford distribution is tissue specific or influenced by changes in gene expression occurring during development. In addition, we investigated whether genes belonging to various functional categories exhibit dissimilar Benford behaviour.
Available RNA-seq data
Raw fastq files of a mouse liver RNA-seq sample were provided by Zahavi et al. . Adapter and low quality bases were trimmed using Trim_galore  and reads were mapped to the mouse genome (build mm10) using TopHat2 . HTSeq-count script  was used in order to count the reads mapping each annotated mouse gene, generating a count table. Frequency of the most significant digit was calculated as described in the “Benford analysis” section below.
RNA-seq raw gene count datasets were downloaded from the ReCount resource . These include the Illumina Human BodyMap 2.0 data set [Gene Expression Omnibus accession code GSE30611] that consists of 16 human tissue types, and the transcriptome data of Drosophila Melanogaster at different developmental stages . “Globally normalized” RNA expression (given in RPKM values) of human tissues from multiple donors was downloaded from the GTEx portal . Single-cell gene expression was obtained from the GEO portal. In these experiments, RNA isolated from 44,808 mouse retinal cells (GSE63472) and 11,149 mouse ES cells at various differentiation time points (GSE65525) were sequenced and profiled using the Drop-seq technology [25, 26]. The raw gene count tables were obtained from GEO, and converted to counts per million (CPM) values prior to mean absolute error (MAE) calculation (see below).
Simulations for dissecting technical parameter effect
The raw data for this analysis originated from the ABRF SEQC study which includes two sample types. The first is the Universal Human Reference RNA (740000, Agilent Technologies) and the second is the Ambion FirstChoice Human Brain Reference RNA (AM6000, Life Technologies). Both of which are well characterized standards that were used as part of the SEQC study by the US Food and Drug Administration (Seqc/Maqc-III Consortium. ). In contrast to the brain tissue samples, the universal human reference pools 10 human cell lines. Three paired-end 100 bp replicates were selected and downloaded (Gene Expression Omnibus accession GSE47792) for each sample type.
In order to simulate the effect of sample origin (cell lines vs tissue), sequencing length, sequencing type (paired or single-end) and sequencing depth on the Benford behaviour, the following analyses were performed: (1) Original 100 bp paired-end reads for both sample origin types (2) 100 bp single-end reads for both sample origin types, in this case only the left reads were used (3) Single-end reads that were computationally trimmed to 50 bp (4) Single-end reads that were computationally trimmed to 25 bp. Instead of using all of the original paired-end reads, we randomly chose (5) 80 % (6) 50 % and (7) 30 % of the sequences. For each simulation, adapter-trimmed (using Trim Galore ) raw sequences were aligned to the hg38 genome assembly (UCSC) with Tophat2 aligner version 2.0.1 . HTSeq-count script  was used to generate counting tables describing the number of reads falling within each annotated gene. Unless specified otherwise the Bioconductor edgeR package  was used to calculate various expression metrics. The Benford test (see below) was applied to the following expression data: (1) raw counts (2) Counts Per Million (CPM) mapped reads values (3) Reads Per Kilobase of transcript per Million mapped reads (RPKM) (4) Gene based Transcripts Per Million (TPM) values, calculated using an in-house R script.
In total, 168 matrices were computed (four gene expression calculation methods for 42 [three replicates of seven technical parameters tested for two sample origins: tissue vs. cell line] generated datasets).
Lists of housekeeping and tissue specific genes
A list of human housekeeping genes was obtained from Eisenberg et al. 2013 . Tissue specific genes were obtained from the GeneCards database [30, 31]. Out of the 466 lung tissue specific genes, 306 which had matched gene symbols in GTEx were used in downstream analysis. A similar number of housekeeping genes were randomly chosen out of the 3701 that were downloaded. Due to the lack of available mouse housekeeping and retina-specific genes, we used the human lists after converting the human gene symbol to their mouse orthologues. A list of 296 retina-specific genes was fetched from the GeneCards database, together with their homologous mouse gene symbols. The list of ~300 human housekeeping genes used above was converted to mouse gene symbols using BioMart Ensembl tool .
was used in order to measure the amount of deviation from the Benford distribution, where Ai is the observed frequency of first digit i, Ei is the expected value as predicted from the Benford distribution and n equals 9.
Quantile normalized lung gene expression data (given in RPKM values) from 133 individuals originating from the GTEx database was analysed for a subset of genes belonging to either tissue-specific, housekeeping or random categories (approximately 300 genes of each). The mean absolute error (MAE) from the Benford distribution was calculated in two ways. In the individual-centric mode, the MAE was calculated for every gene category in each sample (individual) such that three MAE values were generated per individual for either a tissue specific, housekeeping or random gene set. The distribution of these values across individuals was then plotted for each gene category. In the gene-centric mode, the MAE was calculated across individuals for every single gene included in the different gene categories. The distribution of these MAE values within each category was plotted.
In the retina single-cell analysis, genes were defined as expressed if their mean CPM (counts per million mapped reads) values calculated across all cells were in the top 40 % . Since genes which are not expressed inherently deviate from the Benford law, we pre-filtered for expressed genes prior to their ranking according to MAE scores. Subsequently, genes were ranked based on their MAE values and up to 300 top and bottom genes were selected. The genes with the highest and lowest MAE scores were analysed for enriched GO terms and tissues using GeneAnalytics . In the analysis of genes exhibiting both low MAE score and low expression level, we selected 321 genes having mean Log2CPM < 5 out of the 600 genes tested above. These genes were sorted by their MAE score value, and the top and bottom genes were analyzed using GeneAnalytics. Top genes were selected as having an MAE < 0.065 (according to the MAE distribution plot of Fig. 6c in the Results section), and a similar number of genes (25) were selected from the bottom of the list (genes having the highest MAE scores). These genes were subjected to GeneAnalytics “Tissue and Cells” analysis (based on manually curated article information as well as high throughput comparisons) .
In the analysis of differentiating individual mouse ES cells , MAE scores were calculated for every expressed gene across approximately a thousand cells at different time points (0 days representing pluripotent ES cells and 7 days representing differentiating cells) following leukaemia inhibitory factor (LIF) withdrawal. Expressed genes were defined as for the retina analysis. Genes having expression level above log2CPM > 8 in day 0 were selected. This group of genes was divided into two subgroups. One contains all genes having an MAE score greater than 0.04, and the other contains the remaining genes. These gene lists were subjected to descriptor enrichment analysis using GeneAnalytics.
Multidimensional scaling classification
Gene-centric MAE values calculated for every gene across lung patients, as well as the first digit frequencies calculated per gene was used as input for Multidimensional Scaling Analysis (MDS) as well as K Nearest Neighbours (KNN) test. MDS was performed using commands in the edgeR Bioconductor package  The 600 Lung tissue specific and housekeeping genes were divided to training and test sets, with a proportion of 70:30 respectively. A KNN classification test using standard R functions implemented in the “class” package  was performed with various k values (3,5,7,9). Optimal results were observed with k = 7.
In order to determine if a numerical data could conform to the Benford law, Pearson’s Chi-squared Goodness-of-Fit test was performed (see R BenfordTests package  for more details). The null hypothesis is that the population’s first digits distribution conforms to Benford’s Law, hence a distribution having a p-value > 0.05 is considered to adhere to the Benford distribution. A comparison between distributions was done using the Mann–Whitney-U test.
Benford distribution in digital expression data
Benford law adherence in gene categories
Benford and single-cell transcriptome
We subsequently tested the expression levels of the highest and lowest MAE scoring genes (Fig. 6c). In general, we observed a positive correlation between adherence to Benford and expression level. The lowest MAE scoring (most adhere to Benford) genes exhibit significantly augmented expression levels with a wider distribution than their highest MAE scoring counterparts (Fig. 6d).
Since gene ontology analysis tests for an enrichment rather than exclusiveness of biological terms in a list of genes, one would argue that the observation above in which Benford-adherence genes have tissue specific roles, relies on those genes in the list that are highly expressed in the tissue. In an attempt to address this issue, we tested whether the tissue specificity of genes residing on the lower tail of the expression distribution (where the blue and red curves overlap in Fig. 6d), can be distinguished only based on their adherence to Benford. We found, that 19 out of 25 (~76 %) genes with low expression levels, which adhere to the Benford law, were determined as associated with the eye tissue. These genes include ADAMTS1 which was suggested to be involved in the inhibition mechanism of retinal neovascularization  and connexin43 (GJA1) which is the major connexin protein of astrocytes in the mammalian retina [43, 44]. In contrast, only four out of 25 (~16 %) in the high MAE scoring counterparts have any association with the eye and revealed shared biological terms which are inherent in the normal metabolism of every tissue in the body, such as translational processes (initiation, elongation and termination), “nuclear-transcribed mRNA catabolic processes” and “cellular protein metabolic processes”.
Benford in development
Benford predicting power
KNN test investigating the predictive power of the Benford law
K nearest neighbors test (K = 7)
Most of the scientific literature regarding the Benford law deals mainly with its uses in the financial field, for example its application in fraud financial report detection. In life sciences, however, there is scant information regarding the uses of Benford law in biological data systems, and even less information on genomics applications. High throughput technologies provide thousands of measurements from a single biological sample, which present a tremendous source of count data against which to test Benford's law. These include gene expression counts across many individuals, and more recently, single cell measurements, which allow testing of heterogeneity in the nature of gene expression across single cells. Here we report that digital gene expression follows Benford distribution in a wide range of biological tissues and developmental conditions. Although read length and coverage highly influence the ability to quantify differential gene expression [47, 48] they have a negligible impact on the Benford behaviour of gene expression data.
In general, numerical data which follows the Benford distribution, usually have a logarithmic nature . This is, therefore, the underlying explanation why digital gene expression data, which is lognormally distributed, observes the Benford law [49, 50]. This rationale may also interpret the suggestion of Hoyle et al.  in which gene expression adherence to the Benford law is not species specific. Indeed, our findings that gene expression data, originating from either mouse (Fig. 1), human (Fig. 3) or drosophila (Fig. 7) species follow the Benford distribution; indicate that this principle is conserved across metazoans, and may probably be extended to additional clades in the tree of life as long as the logarithmic nature of their expression data is preserved. Although the lognormal distribution of expression levels reflects true biological variability and is not an artefact of the technology , we still cannot rule out that the PCR exponential amplification, performed during library preparation, does not contribute to the Benford behaviour of gene expression. Therefore, the Benford distribution could be tested on PCR-free expression data such as those generated by the Nanostring technology, once these are performed on a whole genome-scale.
In order to investigate whether biological insight could be gleaned through examination of first digit frequencies, we explored these distributions in different gene sets having unique characteristics, such as tissue specific and housekeeping genes rather than scrutinizing the whole gene list. As previously described , tissue specific genes are expressed in fewer conditions than housekeeping. However, looking at a single condition, one tissue sample for example, the dynamic range of expression for genes, which were previously determined as tissue specific, was much wider than that observed for housekeeping genes. Our finding that housekeeping genes violate Benford's law, compared with tissue specific genes, is a reflection of their narrow expression distribution. Repeating this analysis across 133 samples of the same tissue produced the same distribution. This process was also repeated in an additional two GTEx-derived whole-tissue homogenates as well as retina single-cell data, exhibiting similar results.
The observed restricted expression range of housekeeping genes can be explained by the fact that housekeeping genes do not map to random locations throughout the human genome, but instead resolve to clusters [53, 54]. This may subject the clustered genes to the same transcriptional control, leading to a narrow expression range. In contrast to housekeeping genes, tissue-specific genes exhibit a wide expression dynamic range which explains their Benford behaviour. This wide range is surprising in itself since one would expect tissue specific genes, which are defined as genes whose expression is vital to the normal metabolism of the tissue, to demonstrate a narrow distribution of high expression level. Our data suggest that tissue specificity and expression distribution (within a single condition/tissue) are orthogonal characteristics of genes.
It is recommended to analyse large datasets (>1000) in order to discern Benford tendencies . This requirement can be easily met by observing the expression of many genes in a single tissue RNA sample. However, in order to analyse the Benford distribution of a single gene, the recommended experiment sample size should reach a thousand samples, which for the most prevalent RNA-seq experiments, is not practical.
The advantage of high throughput single-cell sequencing technologies is the possibility to dissect the expression of a single gene across a vast amount of samples. We harnessed the availability of two highly parallel single-cell expression profiling datasets available for mouse retina and ES cells, to rank individual genes in accordance with their closeness to the expected Benford distribution. Once this rank was available we could inspect whether it is biologically meaningful. It is unexpected that genes that were selected based only on their Benford distribution property, while completely ignoring their expression value, will share unique biological characteristics. Surprisingly, we found that genes exhibiting the Benford pattern are more likely to have a functional role within the tissue in question, and are likely to be highly expressed. Furthermore, we observed that Benford-adherent genes with low expression levels tend to have tissue oriented functionality rather than basic maintenance functions (translation and transcription processes) which characterise their Benford-divergent counterparts. Therefore, genes that were overlooked for roles in tissue functionality, due to their lower expression level, should now be revaluated for this capacity based on their Benford behaviour. This could be achieved by possibly overexpressing or completely eradicating their expression, thereupon examining the resulting phenotype in the tissue or cell line in question, where they are predicted to have specific roles.
Two approaches were taken in this study in order to test the capacity of the Benford law to predict tissue specificity. The first is by testing gene ontology enrichment of genes that were selected based on their MAE score only, without assuming anything about their nature. When we used this approach on thousands of retina single cell data, we indeed found that genes which adhere to the Benford law tend to have tissue specific roles. This phenomenon could not be observed in GTEx tissue expression levels probably due to the relatively low number of samples which are optimal for Benford analysis. Once additional high-throughput single cell data will be available, this observation could be verified in other tissues as well. The other approach uses an apriori characterised tissue specific and housekeeping gene sets, thereupon testing the structure of these datasets by visualizing the relative distance of the observations. Next, supervised machine learning quantified the feasibility of the Benford law to predict the tissue specific tendency of an unknown gene. The later was successfully applied to GTEx data despite its relatively small number of samples (133 in the lung tissue dataset).
The applicability of the Benford distribution in biological datasets has not been fully realized as of yet. To the best of our knowledge, there are no previous reports in the literature showing that RNA-seq digital expression data follow the Benford distribution. Furthermore, this paper introduces the novelty of relating adherence to the Benford law within gene sets with unique characteristics, such as tissue specificity. Importantly, we demonstrated the application of Benford adherence for testing the likelihood of genes to have a general housekeeping vs. having a unique role in the examined tissue. To summarize, despite its simplicity, adherence to the Benford law is an elegant and robust means to classify genes while totally ignoring their expression level and any other gene characteristic.
CPM, counts per million mapped reads; MAE, mean absolute error; NGS, next generation sequencing
The Levi-Eshkol Fund, Ministry of Science, Technology & Space, Israel [grant number 3-12624] (SG). The funding body had neither a role in the design of the study nor in the collection, analysis, interpretation of data and no role in writing the manuscript.
Availability of data and materials
The datasets analyzed during the current study are available in the ReCount resource: http://bowtie-bio.sourceforge.net/recount/, the GTEx portal http://www.gtexportal.org/home/, and the GEO repository accession numbers GSE63472, GSE65525, GSE47792.
MS-D conceived and coordinated the study; DK, GS, SG and MS-D analyzed the data with advice from DB; GS, and MS-D drafted the manuscript which was approved by all authors.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen Y-J, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Ho CH, Irzyk GP, Jando SC, Alenquer MLI, Jarvie TP, Jirage KB, Kim J-B, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–80.PubMedPubMed CentralGoogle Scholar
- Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11:31–46.View ArticlePubMedGoogle Scholar
- Benford F. The law of anomalous numbers on JSTOR. Proc Am Philos Soc. 1938;78:551–72.Google Scholar
- Newcomb S. Note on the frequency of use of the different digits in natural numbers on JSTOR. Am J Math. 1881;4:39–40.View ArticleGoogle Scholar
- Nigrini MJ. Digital Analysis Using Benford’s Law. Vancouver: Global Audit Publications; 2000.Google Scholar
- Durtschi C, William Hillison CP. The effective use of Benford’s law to assist in detecting fraud in accounting data. J Forensic Account. 2004;V:17–34.Google Scholar
- Hill TP. The difficulty of faking data. Chance. 1999;12:27–31.View ArticleGoogle Scholar
- Sandron F. Do populations conform to the law of anomalous numbers? Population (Paris). 2002;57:755–61.View ArticleGoogle Scholar
- Costas E, López-Rodas V, Toro FJ, Flores-Moya A. The number of cells in colonies of the cyanobacterium Microcystis aeruginosa satisfies Benford’s law. Aquat Bot. 2008;89:341–3.View ArticleGoogle Scholar
- Grandison S, Morris RJ. Biological pathway kinetic rate constants are scale-invariant. Bioinformatics. 2008;24:741–3.View ArticlePubMedGoogle Scholar
- Kreuzer M, Jordan D, Antkowiak B, Drexler B, Kochs EF, Schneider G. Brain electrical activity obeys Benford’s law. Anesth Analg. 2014;118:183–91.View ArticlePubMedGoogle Scholar
- Friar JL, Goldman T, Pérez-Mercader J. Genome sizes and the Benford distribution. PLoS One. 2012;7, e36624.View ArticlePubMedPubMed CentralGoogle Scholar
- Hoyle DC, Rattray M, Jupp R, Brass A. Making sense of microarray data distributions. Bioinformatics. 2002;18:576–84.View ArticlePubMedGoogle Scholar
- Docampo S, del Mar TM, Jesu´s Aira M, Cabezudo B, Flores-Moya A. Benford’s law applied to aerobiological data and its potential as a quality control too. Aerobiologia (Bologna). 2009;25:275–83.View ArticleGoogle Scholar
- Miller SJ. Benford’s Law: Theory and Applications. 2015.View ArticleGoogle Scholar
- Orita M, Moritomo A, Niimi T, Ohno K. Use of Benford’s law in drug discovery data. Drug Discov Today. 2010;15:328–31.View ArticlePubMedGoogle Scholar
- Orita M, Hagiwara Y, Moritomo A, Tsunoyama K, Watanabe T, Ohno K. Agreement of drug discovery data with Benford’s law. Expert Opin Drug Discov. 2013;8:1–5.View ArticlePubMedGoogle Scholar
- Zahavi T, Lanton T, Divon MS, Salmon A, Peretz T, Galun E, Axelrod JH, Sonnenblick A. Sorafenib treatment during partial hepatectomy reduces tumorgenesis in an inflammation-associated liver cancer model. Oncotarget. 2016;7:4860–70.PubMedGoogle Scholar
- Trim Galore. [http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/].
- Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14:R36.View ArticlePubMedPubMed CentralGoogle Scholar
- Anders S, Pyl PT, Huber W. HTSeq - A Python framework to work with high-throughput sequencing data. Bioinformatics. 2014.Google Scholar
- Frazee AC, Langmead B, Leek JT. ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformatics. 2011;12:449.View ArticlePubMedPubMed CentralGoogle Scholar
- Graveley BR, Brooks AN, Carlson JW, Duff MO, Landolin JM, Yang L, Artieri CG, van Baren MJ, Boley N, Booth BW, Brown JB, Cherbas L, Davis CA, Dobin A, Li R, Lin W, Malone JH, Mattiuzzo NR, Miller D, Sturgill D, Tuch BB, Zaleski C, Zhang D, Blanchette M, Dudoit S, Eads B, Green RE, Hammonds A, Jiang L, Kapranov P, et al. The developmental transcriptome of Drosophila melanogaster. Nature. 2011;471:473–9.View ArticlePubMedGoogle Scholar
- Keen JC, Moore HM. The Genotype-Tissue Expression (GTEx) project: linking clinical data with molecular analysis to advance personalized medicine. J Pers Med. 2015;5:22–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, Trombetta JJ, Weitz DA, Sanes JR, Shalek AK, Regev A, McCarroll SA. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161:1202–14.View ArticlePubMedPubMed CentralGoogle Scholar
- Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz DA, Kirschner MW. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161:1187–201.View ArticlePubMedPubMed CentralGoogle Scholar
- SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat Biotechnol. 2014;32:903–14.View ArticleGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–40.View ArticlePubMedGoogle Scholar
- Eisenberg E, Levanon EY. Human housekeeping genes, revisited. Trends Genet. 2013;29:569–74.View ArticlePubMedGoogle Scholar
- Rebhan M, Chalifa-Caspi V, Prilusky J, Lancet D. GeneCards: integrating information about genes, proteins and diseases. Trends Genet. 1997;13:163.View ArticlePubMedGoogle Scholar
- Fishilevich S, Zimmerman S, Kohn A, Iny Stein T, Olender T, Kolker E, Safran M, Lancet D. Genic insights from integrated human proteomics in GeneCards. Database (Oxford). 2016;2016.Google Scholar
- Yates A, Akanni W, Amode MR, Barrell D, Billis K, Carvalho-Silva D, Cummins C, Clapham P, Fitzgerald S, Gil L, Girón CG, Gordon L, Hourlier T, Hunt SE, Janacek SH, Johnson N, Juettemann T, Keenan S, Lavidas I, Martin FJ, Maurel T, McLaren W, Murphy DN, Nag R, Nuhn M, Parker A, Patricio M, Pignatelli M, Rahtz M, Riat HS, et al. Ensembl 2016. Nucleic Acids Res. 2015;44:D710–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Joenssen DW. BenfordTests: Statistical Tests for Evaluating Conformity to Benford’s Law. 2013.Google Scholar
- Ramsköld D, Wang ET, Burge CB, Sandberg R. An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput Biol. 2009;5:e1000598.View ArticlePubMedPubMed CentralGoogle Scholar
- Ben-Ari Fuchs S, Lieder I, Stelzer G, Mazor Y, Buzhor E, Kaplan S, Bogoch Y, Plaschkes I, Shitrit A, Rappaport N. GeneAnalytics: An integrative gene set analysis tool for next generation sequencing, RNAseq and microarray data. Omics. 2016;20:139-51.Google Scholar
- Venables WN, Ripley BD. Modern Applied Statistics with S. Fourth Edition. New York: Springer; 2002. ISBN 0-387-95457-0. https://cran.r-project.org/web/packages/class/citation.html.
- Butte AJ, Dzau VJ, Glueck SB. Further defining housekeeping, or “maintenance”, genes Focus on “A compendium of gene expression in normal human tissues”. Physiol Genomics. 2001;7:95–6.PubMedGoogle Scholar
- Delahaye J-P, Gauvrit N. Scatter and Regularity Imply Benford’s Law… More. 2011. HAL.Google Scholar
- Fewster RM. A simple explanation of Benford’s law. Am Stat. 2009;63:26–32.View ArticleGoogle Scholar
- Saliba A-E, Westermann AJ, Gorski SA, Vogel J. Single-cell RNA-seq: advances and future challenges. Nucleic Acids Res. 2014;42:8845–60.View ArticlePubMedPubMed CentralGoogle Scholar
- Nakamura T, Yabuta Y, Okamoto I, Aramaki S, Yokobayashi S, Kurimoto K, Sekiguchi K, Nakagawa M, Yamamoto T, Saitou M. SC3-seq: a method for highly parallel and quantitative measurement of single-cell gene expression. Nucleic Acids Res. 2015;43, e60.View ArticlePubMedPubMed CentralGoogle Scholar
- Xu Z, Yu Y, Duh EJ. Vascular endothelial growth factor upregulates expression of ADAMTS1 in endothelial cells through protein kinase C signaling. Invest Ophthalmol Vis Sci. 2006;47:4059–66.View ArticlePubMedGoogle Scholar
- Güldenagel M, Söhl G, Plum A, Traub O, Teubner B, Weiler R, Willecke K. Expression patterns of connexin genes in mouse retina. J Comp Neurol. 2000;425:193–201.View ArticlePubMedGoogle Scholar
- Kerr NM, Johnson CS, de Souza CF, Chee K-S, Good WR, Green CR, Danesh-Meyer HV. Immunolocalization of gap junction protein connexin43 (GJA1) in the human retina and optic nerve. Invest Ophthalmol Vis Sci. 2010;51:4028–34.View ArticlePubMedGoogle Scholar
- Tomancak P, Berman BP, Beaton A, Weiszmann R, Kwan E, Hartenstein V, Celniker SE, Rubin GM. Global analysis of patterns of gene expression during Drosophila embryogenesis. Genome Biol. 2007;8:R145.View ArticlePubMedPubMed CentralGoogle Scholar
- White J, Dalton S. Cell cycle control of embryonic stem cells. Stem Cell Rev. 2005;1:131–8.View ArticlePubMedGoogle Scholar
- Chhangawala S, Rudy G, Mason CE, Rosenfeld JA. The impact of read length on quantification of differentially expressed genes and splice junction detection. Genome Biol. 2015;16:131.View ArticlePubMedPubMed CentralGoogle Scholar
- Tarazona S, García-Alcalde F, Dopazo J, Ferrer A, Conesa A. Differential expression in RNA-seq: a matter of depth. Genome Res. 2011;21:2213–23.View ArticlePubMedPubMed CentralGoogle Scholar
- Gierliński M, Cole C, Schofield P, Schurch NJ, Sherstnev A, Singh V, Wrobel N, Gharbi K, Simpson G, Owen-Hughes T, Blaxter M, Barton GJ. Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment. Bioinformatics. 2015;31:3625–30.View ArticlePubMedPubMed CentralGoogle Scholar
- Law CW, Chen Y, Shi W, Smyth GK. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014;15:R29.View ArticlePubMedPubMed CentralGoogle Scholar
- Bengtsson M, Ståhlberg A, Rorsman P, Kubista M. Gene expression profiling in single cells from the pancreatic islets of Langerhans reveals lognormal distribution of mRNA levels. Genome Res. 2005;15:1388–92.View ArticlePubMedPubMed CentralGoogle Scholar
- Dezso Z, Nikolsky Y, Sviridov E, Shi W, Serebriyskaya T, Dosymbekov D, Bugrim A, Rakhmatulin E, Brennan RJ, Guryanov A, Li K, Blake J, Samaha RR, Nikolskaya T. A comprehensive functional analysis of tissue specificity of human gene expression. BMC Biol. 2008;6:49.View ArticlePubMedPubMed CentralGoogle Scholar
- Lercher MJ, Urrutia AO, Hurst LD. Clustering of housekeeping genes provides a unified model of gene order in the human genome. Nat Genet. 2002;31:180–3.View ArticlePubMedGoogle Scholar
- Pauli F, Liu Y, Kim YA, Chen P-J, Kim SK. Chromosomal clustering and GATA transcriptional regulation of intestine-expressed genes in C. elegans. Development. 2006;133:287–95.View ArticlePubMedGoogle Scholar
- Singleton TW. Understanding and applying Benford’s law. ISACA. 2011;3:6–9.Google Scholar