Hierarchicell: an R-package for estimating power for tests of differential expression with single-cell data

Background Study design is a critical aspect of any experiment, and sample size calculations for statistical power that are consistent with that study design are central to robust and reproducible results. However, the existing power calculators for tests of differential expression in single-cell RNA-seq data focus on the total number of cells and not the number of independent experimental units, the true unit of interest for power. Thus, current methods grossly overestimate the power. Results Hierarchicell is the first single-cell power calculator to explicitly simulate and account for the hierarchical correlation structure (i.e., within sample correlation) that exists in single-cell RNA-seq data. Hierarchicell, an R-package available on GitHub, estimates the within sample correlation structure from real data to simulate hierarchical single-cell RNA-seq data and estimate power for tests of differential expression. This multi-stage approach models gene dropout rates, intra-individual dispersion, inter-individual variation, variable or fixed number of cells per individual, and the correlation among cells within an individual. Without modeling the within sample correlation structure and without properly accounting for the correlation in downstream analysis, we demonstrate that estimates of power are falsely inflated. Hierarchicell can be used to estimate power for binary and continuous phenotypes based on user-specified number of independent experimental units (e.g., individuals) and cells within the experimental unit. Conclusions Hierarchicell is a user-friendly R-package that provides accurate estimates of power for testing hypotheses of differential expression in single-cell RNA-seq data. This R-package represents an important addition to single-cell RNA analytic tools and will help researchers design experiments with appropriate and accurate power, increasing discovery and improving robustness and reproducibility. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07635-w.


Why mixed models?
When a dataset contains multiple levels (i.e., a hierarchy) the variability can be split into between group variability and within group variability. With such data, units at the highest level (i.e., individuals) are the only truly independent units and treating any of the units of subsequent levels (i.e., cells) as independent is statistically inappropriate. There is a long body of literature that demonstrates that applying statistical inference to replicates that are not statistically independent without properly accounting for their correlation structure will inflate type 1 error rates and lead to spurious results. As the denominator of most statistical tests is a function of the variance, not accounting for the positive correlation among sampling units underestimates the true standard error and leads to false positives. In addition, treating each replicate as independent inflates the test degrees of freedom, making it easier to falsely reject the null hypothesis.
Here, we implement mixed models as a means of handling the correlation structure in single-cell RNA-seq data. A different, but commonly applied method that is a valid approach for handling the heirarchical nature of single-cell data, is aggregating the gene expression values from each of the cells within an individual and then computing the analysis on the aggregate values. Doing so, however, reduces the number of data points and is a loss of information which makes this a conservative approach. Overall, it has been demonstrated repeatedly in the literature that mixed models lead to the most accurate results when analyzing hierarchical data.
Mixed models are implemented as a means of accounting for correlated data by modeling the hierarchical structure of the data by partitioning out the between and within group variability. Mixed models are an extension of typical linear models that allow for both fixed and random effects. A fixed effect is a parameter that does not vary, but random effects are parameters that are themselves random variables. A fixed effect model (think typical linear regression) assumes the data are random variables but the parameters are fixed. With a random effect, however, the data are random variables and the parameters are random variables at the lower level, but fixed at the highest level. For example, if we were to model the test scores of students from six different schools based on some predictor variable, the data would have two levels: schools and students within those schools. The overall mean test scores across all students and all schools is fixed, but the parameter within each school is assumed to follow a random normal distribution with the same mean as the overall mean and some variance.

Why the two part hurdle model?
The two-part hurdle model is implemented in a tool called MAST (Finak et al., 2015) and it is used to simultaneously model the continuous (non-zero values) and discrete ("expressed"/"not expressed") components of single-cell data. This is an excellent model for identifying genes that are either: 1. expressed at varying magnitudes (e.g., average expression of 100 vs average expression of 1000) 2. expressed at different rates (e.g., 30% of cells expressing the gene being tested vs 60% of cells expressing the same gene) 3. some combination of both Other models are primarily testing the first hypothesis, but we believe the second hypothesis is just as meaningful and will allow users to capture more genes.

Installation and R setup
Before beginning, you will need to install the hierarchicell package This can be done by: install.packages() from CRAN devtools::install_github() from GitHub (this will download the package in development, so will offer latest developments, but may be unstable) To load the package simply type "library(hierarchicell)". This will attach the package and all of its related functions.It is important to install and load all of hierarchicell's dependencies as well.

Load and filter input data
After installing and loading the R-package, the initial steps will be to read in your data and format it properly. Then you have the option of filtering your data. By default, the program will only filter out cells and genes that contain all zero values. This is the default because, we believe that the proportion of zeros in your preliminary dataset is informative for downstream power calculations. If you expect your next set of data to contain fewer zeros or to be of higher quality, then filtering at this step may be recommended.
Data should be only for cells of the specific cell-type you are interested in simulating or computing power for. The input data will need to be a data.frame where the unique cell identifier is in column one and the sample identifier is in column two with the remaining columns all being genes. Data should be only for cells of the specific cell-type you are interested in simulating or computing power for. Data should also contain as many unique sample identifiers as possible. If you are inputing data that has less than 5 unique values for sample identifier (i.e., independent experimental units), then the empirical estimation of the interindividual heterogeneity is going to be very unstable. Finding such a dataset will be difficult at this time, but, over time (as experiments grow in sample size and the numbers of publicly available single-cell RNAseq datasets increase), this should improve dramatically.
If you do not have a preliminary dataset you would like to use, running the filter_counts function without specifying any options ("filter_counts()") will load a default dataset of pancreatic alpha cells. This a reasonable starting place for researchers looking to simply get an idea of power for their study.

Estimate and plot various parameters
After running the filter_counts function and ensuring your data is in the proper format, the next step involves estimation of the simulation parameters. The functions estimate the empirical distributions for dropout rate, global gene means, and they model the hierarchical variance structure of the input data. The default for these functions assumes the data are normalized somehow (i.e., "TPM", "RPM","FPKM"), however, raw data can be input as well. It will just need to be specified (type = "Raw").
Here, we will run the global compute_data_summaries function first. This is the only function you will need to run to continue on with next steps. In the latter steps we will use the individual functions within the compute_data_summaries function to plot distributions and visualize how well the program is modeling the behavior of the input data. This is recommended for you to ensure nothing extremely concerning is happening with your data.
NOTE: With some data types, the intra-individual variance is (more often than not) smaller than the intraindividual mean. This leads to negative dispersion estimates which end up being disregarded. We are working on developing more advanced simulation methods that take this into account and use a poisson or gaussian instead of a negative binomial where these scenarios exist.